Data Engineering

Up-close overview of data engineering from a data engineer’s point of view

Michal Gancarski, Data Engineer at Zalando, shared with us his perspective on the data engineering developments during the past several years and brought us closer to the process of buidling data pipelines and serverless data infrastructure.

March 11, 2020

Five years in data engineering is like an eternity. There have been some major developments in the field that contribute to more efficient data processing and streamlined collaboration between data scientists, data analysts and data engineers.

Michal Gancarski, Data Engineer at Zalando, shares with us his perspective on the data engineering developments during the past several years and brought us closer to the process of buidling data pipelines and serverless data infrastructure.

Hyperight: Hi Michal, we are happy to have you as a speaker representing Zalando at the 5th Celebrate edition of the Data Innovation Summit. It’s your first time with us, so please tell us a bit more about yourself and your role at Zalando.

Michal Gancarski: I started my career in software development as a web developer in Krakow, focusing mostly on smaller, freelance projects. After moving to Berlin over six years ago, I joined RapidApe, an ad-tech startup where I had the opportunity to tackle more complex issues like managing data models, maintaining and developing toolkits for building analytical dashboards, building an analytics API and, more importantly, diving into data processing workflows necessary to keep the operation running.

This experience allowed me to successfully apply for a backend engineering position at Zalando, where I quickly switched to what interested me the most – data engineering. Since then, I spent most of my time at Zalando working on various subsystems of the Data Lake the company was building. I have taken part in diverse projects like a centralised collection of dataset metadata, pipelines delivering those datasets, access management for tens of engineering teams and others.

Currently, while still at Zalando, I am focusing less on data infrastructure and more on the development of data and machine learning pipelines. I am a member of a team that applies tools of data science to help Zalando automate and improve its buying decisions with respect to distributions of apparel sizes for various combinations of clothing categories and styles. There is an (as of yet) untapped potential there to reduce waste, optimise stock and, in the end, positively influence the bottom line of the company.

Zalando store — Image credits: Marco Verch (CC BY 2.0)

Hyperight: As 2020 is the year in which the Data Innovation Summit turns 5, could you point out what have been the most important developments with data and advanced analytics in the last five years according to you?

Michal Gancarski: Five years in data engineering seems like a long time. It is hard to believe, for example, that distributed data processing engines like Apache Spark and Apache Flink, are only twice as old, even if we take into account early development periods in academia.

Anyway, there were many important developments in data and analytics over this period. Let me talk about what I think the most significant are:

The emergence of scalable and relatively inexpensive cloud storage, like AWS S3 or Google Cloud Storage. At first, those services were facing issues with compatibility with common frameworks like Hadoop Map-Reduce. Fortunately, over time libraries were developed to handle those issues transparently. Nowadays, large object stores serve as a de-facto replacement for clusters running distributed file systems like HDFS, serving as data sources and data sinks for computation, processing and query engines like Spark, Presto, Impala or Flink.

Data Lakes as repositories of datasets complementary to traditional data warehouses and data marts. While the concept of the Data Lake is older than five years (it dates back to the year 2010), its adoption only accelerated more recently. This is partially a consequence of the previous trend. With operational simplification and sinking cost of data storage, companies preserve more and more datasets in their raw form, to process them differently depending on the use case. Sometimes, to train machine learning models (which requires outputting new attributes through feature engineering, combined with flat, denormalised schemas), sometimes for more traditional BI applications. The later usually means normalised star and snowflake schemas.

Notebooks as interactive gateways to data analysis, data science and engineering. The ability to mix different programming and query languages, visualize data and leave explanatory comments in one, shareable environment, changed the way we work with data. For example, a data scientist can share her report or experiment (code, data, discussion of the methodology) with the rest of the team in one place. This notebook can be then used by other data scientists to validate the results, or by business analysts to generate a simplified report and communicate the impact of the results to decision-makers leading the organisation.

Even more, data engineers can use the same notebook to improve their understanding of what data and in what form is needed to turn the experiment into a production pipeline. In fact, I see a growing number of data engineers working with notebooks to prototype data pipelines and experiment with various ways of expressing data transformations they want to implement.

Python becoming the common language of data engineering and data science, slowly (but not completely) replacing Java, R or Matlab. A modern team consisting of data engineers and data scientists can perform most of its tasks using Python and libraries written for it. This includes building and scheduling data pipelines, interacting with cloud infrastructure, performing preliminary data analysis, prototyping or deploying machine learning models in production.

Even if the core of some of the libraries and frameworks are implemented in more performant languages (like C++ in case of TensorFlow or Scala when we talk about Spark), there is always a way of interacting with them using Python. We got to the point where we see job openings asking explicitly for “Python Data Engineers”.

To be clear, Python will never fully replace other languages but at the moment it is the safest bet for someone willing to start their career in a broadly understood field of data.

Stream processing rising in popularity and enabling new applications of analytics and machine learning to problems like financial fraud detection, optimization of online advertising and recommendations, but also IoT in general. Especially the last one looks significant. From electric scooters to monitoring of industrial devices or public transit systems – we are seeing huge improvements in all of those areas.

Thanks to advances in engines like Apache Flink and other developments, like the introduction of Structured Streaming to Apache Spark, it is becoming easier to express correct, complex computations on streaming data and compose those into larger workflows. It is, in essence, expansion of dataflow programming into the world of large scale, distributed systems handling high volume real-time data streams.

Hyperight: You are going to present at the Data Engineering Stage on how to design, build, deploy and monitor a serverless data infrastructure. As you yourself are a data engineer, your presentation is of a more technical nature. Could you explain to us how the initiative for a serverless data infrastructure began in Zalando and what are the benefits from it?

Michal Gancarski: When I joined Zalando, the company was already deploying nearly all of its microservices in the cloud (in this case – on AWS), using tooling built internally for this particular purpose. However, there was still no certainty on how to take advantage of serverless components in the context of data processing. While it was clear that S3, Amazon’s large-scale object store, is going to be the go-to location for Data Lake datasets, our thinking about data pipelines still gravitated towards traditional applications deployed on EC2 instances that, for all intents and purposes, are managed virtual machines.

This approach has proven to be of limited scalability in terms of engineering capacity available to the Data Lake team. Given the complexity of building and managing a Data Lake with thousands of datasets, we were looking for ways to offload as much operational burden as possible to the cloud vendor and let its infrastructure handle growing data volumes, traffic and density of scheduling.

Since essentially every data pipeline is a collection of queues, schedulers, workflow managers, processing engines and, last but not least, storage layers, we started looking into replacing more traditional tools with their serverless counterparts, like SQS (Simple Queue Service) and using AWS Lambda (lightweight units of stateless computation) to compose more elaborate applications.

The biggest breakthrough in this direction came when AWS Step Functions, Amazon’s serverless workflow offering, became available in Europe. We decided to try using them for a prototype version of one of our pipelines, and it worked out really well. The pipeline was put into production much faster than we could otherwise do. So far, it has been running without major incidents for over a year.

After this initial positive experience, we have decided to double down on the approach, not only for new pipelines, but also for rebuilding those that were already in place. This way a relatively small team was able to maintain and develop a petabyte-scale Data Lake and operate several large pipelines that not only deliver a fixed collection of datasets but also let other teams at Zalando add more of them using a self-service approach.

Hyperight: Is there anything you have to be careful about when building serverless data infrastructure?

Michal Gancarski: There are several aspects of serverless data infrastructure that companies need to be mindful about. They are mostly related to the way cloud vendors operate.

First of all, vendor lock-in. This may or may not be a significant issue, depending on how we look at it. However, some mitigation strategies in how, for example, data pipelines are built, can be put in place. If you use Google Cloud Functions or AWS Lambda, try to write as much of their code in a way that is transferable to other platforms. More generally, when dealing with distributed computation, make sure you can express it in a portable way. For example, a stream processing job written for Apache Flink can be reused in Kinesis Data Analytics on AWS.

Second, mind the scalability of your budget and projected cost of storage and computation. While cloud infrastructure promises rapid scalability without all the operational hassle usually associated with it, it will scale beyond the size you may want to pay for. It is easy to add data to an “unlimited” object store, but with every additional gigabyte, you will have to pay more on an ongoing basis. To mitigate that, enforce proper data retention policy that will make sure rarely used or low-value data assets are deleted or moved to cheaper storage classes.

Third, remember that cloud and serverless is just a layer (or several layers) of software sitting on top of physical data centres. This means that at some point, you may reach soft or hard limits of your cloud provider. Soft limits are usually easy to handle by contacting customer support. However, before you decide to use a service being part of the offering of your cloud vendor, check whether they have hard limits imposed on some of their dimensions (data and request throughput, scaling out, scaling up, etc.). This way you can avoid nasty surprises at a critical moment when you are the most vulnerable, i.e. when you need to scale out further but cannot do it or are slower to do so than expected.

Hyperight: What are some data engineering trends that would mark 2020 according to you?

Michal Gancarski: Apart from the continuation of what was happening for the last five years, an additional one comes to my mind.

In 2020 we will see growing popularity and adoption of table formats offering transactional guarantees on large datasets stored in cloud object stores. I am talking about solutions like Delta Lake, Apache Hudi or Apache Iceberg. Their biggest draw is that they bring back ACID properties to large datasets accessed and processed by a diverse ecosystem of computation frameworks. Working with storage formats that ensure snapshot isolation or non-conflicting, transactional writes originating in multiple sources, can greatly simplify (and sometimes even enable) many data engineering tasks.

As a consequence of the above, we will see further unification of stream and batch data processing in terms of how we express data transformations but also how we store and transmit data in our daily workflows.

Current advances in this area include, among others, streaming support for Iceberg in Apache Flink, achieved by Netflix – the original creators of Iceberg. Another noteworthy development is continuously improved integration of Delta Lake (which originated at Databricks) not only with Apache Spark but also with other engines like Presto or Redshift Spectrum.

Hyperight: Some experts predict that the solution for the lack of data engineers would lead to “citizen data engineers” – employees outside of the data engineering team will oversee and manage data pipelines, as well as the overall data lifecycle in order to meet data engineering needs. Do you see this happening in 2020?

Michal Gancarski: We are already seeing this happening at Zalando to a certain extent, with some of the most important (and largest) data pipelines being co-managed by teams interested in particular datasets. In this pattern a central team is building a data pipeline framework of sorts, that can be further configured by a stakeholder when needed.

A stakeholder is determining the source of the data, types of transformations that should be performed on it and the location of where the results are to be stored. In our case, it is happening using pull requests to a central repository. After a PR is reviewed and merged, a continuous integration process is triggered, and the pipeline framework is adding a new dataset to the list of already processed ones.

This approach is not limited to ETL, though. At Zalando, we have deployed similar mechanisms for metadata management (mostly for making dataset schemas and security classifications of dataset attributes updatable by interested teams) and infrastructure management. Using a pull request, you can, for example, request a new Databricks Spark cluster, fix incorrect metadata if you find a mistake or request access to particular datasets.

Hyperight: Thank you for your time.

Michal Gancarski: Thank you as well. I am looking forward to presenting at the Data Innovation Summit this year!