DevOps vs. DataOps
DevOps methodology was introduced because people were looking for a technical solution to an operational problem. It was a ground-breaking set of practices that took down the barriers between developers and the operation/business. But matters got even more complicated when data and data science entered the business as a new stage in the production process. As a reaction, DataOps emerged as a solution for bringing the battle-proven combo of developers and operations into data processing and eliminating silos between developers, data scientists and operations.
In his effort to educate and demystify the relatively-new methodology, Lars Albertsson, founder of Scling, presented some DataOps practices and tools that are applied in everyday development and operational workflow.
Disrupted and disruptors
DataOps, agile and lean doesn’t mean doing risky operations of testing the code in production. This was a lesson learned from one of Lars’s earlier experiences. Risky operations had been just a disaster waiting to happen, and it finally did when he made one mistake in the deployment that affected 4 million users. For Lars, it was the worst day in his professional life, he remembers.
Operational risks, silos and friction are what separates disrupted companies from the company-disruptors in the digital revolution. The disruptors, in contrast, have taken steps to address them and move forward.
Lars explains that disruptors in the digital revolution share common properties that allow them to excel with DataOps:
- Data mature companies that can get a sustained ROI from machine learning.
- Quick in turning an idea into production.
- They keep their data into one homogenous data platform and work with data pipelines.
- Organisations are aligned by use cases rather than technology.
- Data internally democratised.
- Diverse teams of developers, data scientists and operators.
Big data – a collaboration paradigm
Although the hype around the term big data is somewhat dwindling, its main impact still resonates. And it’s not about technology, but about collaboration, Lars maintains. It’s a practice where the data of the company is put in one big data pool (lake) so it’s easily accessible and democratised across the whole organisation. So everyone with an idea can reach the data and build a data pipeline to realise the idea. Before big data and data lakes, there were point-to-point integrations with data, so if a person had in idea, they needed to go to 5 different people which slowed down the realisation, Lars points out.
Types of data integration
Lars explains the 3 main ways to integrate different systems:
- Online integration – Service-oriented architectures or microservices. This a point-to-point synchronous integration, with a very short latency.
- Offline integration – This type of integration is on the other side of the spectrum because of its high latency. It works with file storage (Hadoop) and it is processed with an asynchronous batch of hours or days of data.
- Nearline integration – The compromise in the middle of the spectrum. It is done with stream processing, where the data is processed in real-time, so it’s almost as fast as the online integration, but it’s still asynchronous as it’s broadcast.
Everyone would agree that faster integration is always better. But every luxury comes with a tradeoff, and integration is not an exception. This tradeoff is operations.
Operational manoeuvres – tradeoffs
Online point-to-point integration with microservices
The three operational scenarios Lars explains are
- Rolling out an upgrade – Should be carried out with the highest level of attention because it can affect millions of users. The upgrade should be done with proactive QA.
- Service failure – If you have services failure, it affects online users. Lars also states that cascading failures are not uncommon in this scenario where one service failure activates the next and the next one.
- Bug – The worst scenario is if you by chance introduce a business logic bug and emit incorrect data, which spreads across the system and causes data corruption. Unfortunately, there are no good patterns and tooling to recover from it. One option is to do a backup, but that inevitably will cause data loss.
The other side of the spectrum is batch processing. Lars explains how the process goes in a data mature environment. Data is dumped from online services to the data lake. The data is raw and unprocessed and is kept in a cold store for further processing. Each processing step is called a job and produces a new dataset (a new file or set of files). These datasets are immutable, if they are correct they are left as they are. At the end of the processing line – the pipeline – you get a data value like a recommendation index or a fraud detection model, which is pushed out to the online services to be used by the users. The component that ties this whole process together is called workflow orchestration. According to Lars, this technology is probably the most important for business processes because it enables collaboration.
Offline batch processing
Lars continues to explain the operational manoeuvres in an offline batch setting:
- Upgrade – In an offline batch setting, we can have an instant update rollout. There is again a certain risk because if the processing job crashes, the pipeline stops and we have a service failure. But users are not affected and since the harm is of a smaller scale, the process is done faster.
- Service failure – As mentioned above, when a service failure happens in an offline batch scenario, the pipeline stops. But no data is lost and online service runs on old recommendation indexes and old fraud models, which can be used in the meantime until it’s fixed.
- Bug – if we introduce a bug in our code in batch processing, downstream is affected and a faulty fraud detection model rolls out, which will be used in the online service. It is harmful, but it can be fixed by reverting back to old versions, finding and fixing the bug and removing all faulty datasets downstream. The fix is a quick operation.
Production critical upgrade
When there are critical pipelines to work with, like in Lars’s example financial calculation pipelines, the best practice is running parallel pipelines, comparing and understanding the differences. The parallel pipelines don’t disrupt each other so we only need a production environment, no development or staging environment, which reduces the cost of operations.
Nearline – streaming
The streaming operation shares the properties of both online point-to-point services and offline batch processing.
- Upgrade – If we have a new pipeline to roll out, we can run the streaming pipelines in parallel and compare them if the pipelines are critical.
- Service failure – In case of a failure, the pipeline stops and causes damage. But the stream storage keeps history. So when the bug is fixed and it’s up and running, we can replay and recover very easily.
- Bug – But introducing a bug and produce erroneous data is more complicated because it is fed into downstream services. Compared to the batch processing where we have 3 days of faulty datasets which can be easily managed, in streaming, we have 3 million faulty events and it’s extremely hard to track them because the dependency graph is implicit. Unfortunately, there are not toolings yet that can help recover.
As in all other processes, data processing needs to be monitored as well. Lars shares what aspects are important to monitor and what are the best tools for monitoring.
– Pipeline reliability – the best tool for measuring is Airflow. Although Airflow is not primarily a monitoring tool, but a workflow orchestration tool, it has a monitoring view. So if you use Airflow for workflow orchestration, it includes a pipeline availability dashboard.
– Data correctness (for one dataset) – best tools for measuring are processing tools counters (Spark, Hadoop) which you bump when you reach an unexpected path in your codebase. These counters are collected and pushed through the database and folded into the regular monitoring structure.
– Data correctness (comparing historical datasets and multiple datasets) – a dedicated pipeline quality assessment.
Building machine learning features
Only after companies are successful in all previous stages, can they focus on building machine learning features. Machine learning creates hype and everyone wants it, but not everyone can build it because it’s really complex, states Lars. A real machine learning feature consists of multiple pipelines that train models and measure user behaviour to see if the models are good, alternative models e.g., for the summer season, for Black Friday, for the Christmas season, and new and old models between which companies need to be able to alternate.
But in the end, “machine learning is less about models and more about engineering complexity,” argues Lars.
DataOps – a double-bladed sword
What we can learn from all the above is that DataOps connects development, data science and operations. But it also presents a constant tradeoff between data latency on the left part of the spectrum and innovation speed on the right side. For companies starting with DataOps, Lars highlights that it’s much better to be skewed towards innovation speed because you can focus on business logic, which is what users value.