Our mission as the forecasting team of Maersk is to deliver valuable forecasts to support and automate the planning and delivery of the physical products our customers book. We develop and operate several machine learning models that are crucial for our business. To reliably deliver these forecasts, we have to have good DataOps practices and improve our practices constantly.
As a team, we apply a Site Reliability Engineering mindset to our data development and operations: We aim to prevent operations issues, purposefully minimize operations work, and by that scale what we can deliver as a whole. It is a continuous improvement process to refine our development and operations processes to deliver better and better outcomes.
For me, as a Data Engineer, this means continuously improving developing, and operating data products. Drawing inspiration from Site Reliability Engineering or Software Engineering in general is, of course, not a novel idea. However, I often see and hear that because data is in the mix, we cannot apply software engineering practices.
Yes, data makes things different: data constantly changes, data is dirty, data is late, data is generally awfully behaved! But let me try to argue for applying software engineering practices to improve the way we work with data – and give some examples of how we handle DataOps challenges in our team.
Iteration cycles are arguably the most crucial factor for efficient development work. The faster we can iterate with the right feedback, the better. The same is true for operations, we want to be able to quickly solve problems that occur. And both, the data development work, as well as the operations work, can have iteration cycles that are just too long when it comes to data. Be that the long time the data transform runs, the lengthy process of getting something deployed, things breaking downstream from our change causing a lengthy rollback process, or trying to figure out where the bad data got into our data hairball.
One way to develop fast is to develop locally – if the problem is large, breaking down the problem to fit on our machine is a great way of getting snappy feedback to start with. Then we want to deploy our code as fast as possible and run against production data, we want to get the right feedback and iterate if needed. Of course, in that process, there are lots of things that can go wrong but again, this is something software engineers face day-to-day. We want to continuously integrate using tested code to have confidence that deploying our changes won’t break things, we want to automatically create disposable environments or namespaces to isolate our development when needed, etc.
To resolve operations issues quickly, we want to have a system that is easy enough to understand. Yes, we want observability, but we need to be able to comprehend what is wrong and potentially where it is wrong. And once we figure out what is wrong, we want to add guard rails in our code, so it does not go wrong again. And then we want to take it a step further and prevent these problems from occurring in the first place – we want to detect problems as soon as they occur and stop operations and fix things proactively.
Here are some tangible examples of how my team tackles the above challenges to achieve short iteration cycles and prevent us from wasting time on operations. Mind you, we are not perfect, but we are improving 🙂
Continuous Integration and Change
There are many ways of working with CI in data. Automated testing plays a key role to avoid introducing bad changes, but more importantly, the concept to integrate new code and jobs continuously is a key enabler for the evolution of our data as well as for collaboration.
In my team, as probably in many others, the main branch is always deployed to production. We do not have any fixed lower environment, only disposable on-demand environments. Currently, we have the notion of feature branches if we want isolated environments. These branches automatically create isolated environments in our non-production environment. However, with the capability to read data from prod, which is crucial for fast iteration. This is critical because we want to avoid managing multiple fixed lower environments and we need to validate changes against the full production data to be confident about our changes.
Now, this feature branch allows us to test the effects of all downstream jobs if needed and is therefore fantastic to avoid breaking downstream jobs. So essentially, we are taking the simple branching concept and applying that to data: We build the feature branch (automatically run all the jobs including downstream) and validate that it builds (validate the changes in the data including downstream) before we deploy to production. It is simple, easy to reason about, and therefore powerful.
Another practice we often use is building dark pipelines, i.e. directly building new data pipelines in production and thereby avoiding spending time on merging branches, deploying to different environments, and avoiding errors when configuring different environments.
We will take this a step further soon by implementing feature toggles for our data development and ditching complications that come from using branches.
Testing, Testing & Testing
Testing is fundamental, there is a great recent post by Gergely Orosz  on the value of unit testing in software and it applies equally well to data: We validate our code and understanding, document complicated transforms, and create a refactoring safety net.
The good news with data pipelines is that they are easy to test. If we build functional data pipelines we can simply generate some input data and test against the output. We can do that for one function, a step in a pipeline, or the whole pipeline to test our data flow end-to-end. We generally focus on two things: unit test complicated functions or transformations and bigger end-to-end tests.
Apart from code testing, we also focus on data testing. Following DataOps principles, we try to fail our pipelines as early as possible when we detect bad data. For this, we use great expectations : We have optional steps in our orchestrator to run validation for any source or destination of a pipeline. If the validation succeeds everything runs, if it fails, the pipeline run will fail and trigger an alert for support. We treat data incidents the same as any other Ops-related incidents, anyone on the support team picks it up and fixes it.
The data testing has tremendous value for us as we sit in a very heterogeneous landscape with external data dependencies. Since the implementation, we were able to prevent approximately one serious incident per month, where previously we would have produced garbage forecasts.
Orchestration & Observability
Workflow orchestration is a key component of our work. Not only does it need to reliably orchestrate our pipelines, but it also needs to work well in our development process – it must be easy to use for our engineers and scientists and easy to collaborate with.
At Maersk, several teams created a workflow orchestrator for batch workloads that we are using and developing today. While this software is outside the scope of this post, a key aspect is that it is essentially just relying on Kubernetes and some file systems (such as Azure Blob), meaning that interacting with the distributed pipelines is simply using the Kubernetes API. It self-heals with retries, scales and is data-driven (when an upstream dependency is refreshed, the job runs).
We can observe the state of our pipelines and we can directly get the logs in Kubernetes or DataDog. And since we are also sending all our data quality metrics there, we can have a nice poor man’s version of data observability in the same stack as all our other metrics. Again, using simple software engineering tools and practices, we solve some essential DataOps challenges.
We follow several practices to create a resilient system and make our data engineering and science work reproducible.
One of the most important concepts is functional data engineering. And most importantly, immutability of data. We simply use files in our object store and each pipeline is a functional transform, it takes the data and produces transformed new data in a different location. The dataset is never mutated.
This makes it very easy to reason about differences down the line. Every dataset we use is an immutable snapshot, so if I run the same model on two different snapshots, I can backtrack any possible differences. It also means that I can very easily experiment with the data because I have a reproducible base.
Our orchestrator takes care of that – it does not let us mutate data, which is great for data engineering and machine learning alike. You can watch Maxime Beauchemin  or Lars Albertsson  talking a little more about functional data engineering if you want to dig into this.
All of the above-mentioned processes are only efficient if automated: We do not want to manually provision environments, run tests manually, or ask another person or even team to deploy changes. It all relies on automation, on applying software engineering to these development and operational challenges.
And all of these processes are important DataOps processes.
I hope I showed that we can indeed apply a lot of the same practices from software engineering to solve DataOps challenges. In our team, we make changes to production several times a day – be that hotfixes or big changes. Creating a new data pipeline in production takes minutes, iterating on datasets in production takes hours to days, and fixing data problems also only takes minutes to hours (if we can fix them internally). Our preventive data quality measures continuously deliver value to the company by preventing garbage forecasts from being published.
All in all, we are getting better and better at this! But there is still a way to go: We want to implement feature toggles to rapidly and easily test data changes or inserting steps in data pipelines, we want to try and take data quality to the next level with anomaly detection and much more!
Note: I purposefully left team and org topologies out of scope, but they are extremely important. Having silos and handovers inherently slows us down and leads to the degraded quality of our work. So while not discussed here, that topic is at least as important (and part of DataOps) as the technical topic described in this post.
Sources https://blog.pragmaticengineer.com/unit-testing-benefits-pyramid/  https://greatexpectations.io/  https://www.youtube.com/watch?v=4Spo2QRTz1k  https://www.youtube.com/watch?v=eD1yF3fAZcY
Micha Ben Achim Kunze will be presenting at the Data Innovation Summit on DataOps is a Software Engineering Challenge, and how viewing your data and operations challenges as software engineering challenges will make you orders of magnitudes more effective.
Learn more about the Data Innovation Summit
About the author
Micha Ben Achim Kunze, Lead Data Engineer at Maersk
Starting my career in science, my passion and obsession with automating myself out of a job turned me into an Engineer.
As a Lead Data Engineer at Maersk I focus on leveraging good engineering to solve hard problems: reliable data products, consistently high data quality, a high sustainable velocity of change, and maintainability.