The Data Mesh paradigm presents a huge potential to replace the centralised data lake and data warehouse as the dominant architectural patterns in data and analytics, describes Max Schultze, Data Engineering Manager at Zalando.
Back up by his personal experience with applying the Data Mesh concept in practice and dedicated field research, Max is joining us at the 6th edition of the Data Innovation Summit to reveal the most common pain points at different stages of the journey and battle-proof approaches to overcome those challenges. He is also bringing both technical and organisational insights ranging from companies that are just starting to promote a mindset shift of working with data, to companies that are already in the process of transforming their data infrastructure landscape, to advanced companies that are working on federated governance setups for a sustainable data-driven future.
As a segue to his talk, Max shared his knowledge on the core principles of Data Mesh, the idea behind domain-driven data products, his lessons learnt from applying Data Mesh and his piece of advice for moving towards Data Mesh architecture.
Hyperight: Hi Max, I’m very excited to welcome you as a speaker to the 6th edition of the Data Innovation Summit. What would you tell us about yourself as an intro to our discussion?
Max Schultze: Hi Ivana, thanks a lot for inviting me, I am very excited to be here this year. Data Innovation Summit has a reputation for bringing praxis proven ideas in the data space to a broader audience, and I am very happy to take part in the 6th iteration of it to share more insights into the topic “Data Mesh in Practice”. I am currently a Data Engineering Manager at Zalando, Europe’s biggest online platform for fashion, and had the opportunity to experience innovations and challenges in the data space first-hand by leading the team responsible for the storage layer of a multi-petabyte data lake.
Driven by that I started to get involved with the Data Mesh idea at the end of 2019 and soon realized that many of the presented concepts are very close to the things we discovered and tried to address on our own. That realization brought me into the position to start talking publicly about the practical parts of the topic. By now I followed that up with several conference talks, as well as introductory O’Reilly training on the topic as well as a soon to be released industry report.
Hyperight: At the Data Innovation Summit 2021, you are going to present on Data Mesh in Practice: How to set up a data-driven organization. Data Mesh is one of the latest trends in data analytics promoting distributed domain-driven architecture that holds promise to replace centralised data lakes and data warehouses. What are the core principles of Data Mesh that make Data Mesh a better architecture than a centralised one?
Max Schultze: First and foremost Data Mesh is trying to address the way we think about data. For many years data has been merely a side product of the production processes we are operating in our companies. While inside of data warehouses we tried to address issues of data quality, it was usually a few central teams that took care of such and we had to realize that the approach ultimately does not scale with the ever-growing amount and variety of data we are producing today. The data lake seemed to be our saviour for a while, as through new technologies and the shift to the cloud, we were introduced to virtually infinite storage and processing capacity. Unfortunately quickly the question became “What data can we store?” instead of “What data should we store?” and our ambitions to create a well-maintained data lake of high data quality quickly turned into data swamps of unclear ownership and responsibility.
This is where Data Mesh is coming in and trying to address the mess many of us are facing right now. By introducing the idea of data products we attempt to turn previously unmaintained datasets into valuable assets with a clear purpose and defined stakeholders. Simultaneously we speak about doing so in a distributed domain-driven way by ensuring that the ownership and responsibility of such data products lie with those that know the data best. To make such a distributed setup truly scalable it becomes necessary to provide a self-serve data-agnostic data infrastructure platform. Lastly, to ensure distributed data products do not turn into disconnected domain silos, we are introducing the concept of federated computational governance.
Hyperight: Domain-driven data products are the key concept of Data Mesh. Could you please explain to us the idea behind them?
Max Schultze: Treating a data set as a product means that a team developing such a data product needs to have product management that defines a roadmap for that dataset, manages requested features, and ultimately understands the requirements of the data product’s customers, i.e. its internal users. Conversely, however, the team also gets resources and management support based on the success of their data product. For instance, if more internal users are using the data product or if more other data products are built on top of this team’s data product this is appreciated like building a successful digital product for external customers.
To define decentralized ownership for such data products, data mesh applies domain-driven design. From an architectural perspective this means that instead of using systems, technologies, or process stages as the guiding criteria for structuring ownership, business domains or their subdomains should be used to define boundaries of ownership. The idea here is to build up domain expertise and then give domain experts both the authority to make the important decisions and the capabilities to implement these decisions (and deal with the consequences) that are necessary to generate the most value from the data that belongs to their domain.
Hyperight: What are your lessons learnt from applying the Data Mesh concept in practice?
Max Schultze: Applying the Data Mesh concept in practice is a long and tedious journey. Ultimately we are trying to change the way we work with data in a broader organizational scope. Pushing for an organizational rethinking, however, does not mean that you cannot play your part. As with many big changes, the first steps are small and it is important to foster local culture and build the first successful MVPs before trying to attempt a company-wide scale.
Personally, my biggest learning came from the data infrastructure platform side of things. While it is absolutely possible and even necessary to build the right tooling for your data mesh to scale, it is not about the technology itself. It is more important to follow the underlying principles of building self-service infrastructure in a domain agnostic way, there are many tools to get you there and the specifics will highly depend on your company’s setup.
Hyperight: One of the key points in your talk will be the main pain points. What are some of the biggest challenges when implementing Data Mesh?
Max Schultze: Especially when getting started it is easy to hit early roadblocks that might seem insurmountable, but awareness for some of those can clear a path. Don’t overload your people. Existing teams in many cases can have the necessary skills to start building initial data products, especially when it comes to product managers, however, it is important to not only factor in skills but also capacity. Don’t place your seed project that is supposed to change your company in a team that is already overloaded with their day to day business, without allocating additional time and resources to take it on.
Another important challenge sparks when taking on data infrastructure responsibilities, as without conscious decision making about what capabilities to provide, it is easy to again take on central responsibility for data and with that run into the same scalability issues that we were originally trying to supersede.
Hyperight: What would you advise companies that are thinking about starting their Data Mesh journey? What are some best practices to follow?
Max Schultze: Start small but with commitment. The companies that I have seen to be most successful with moving towards a data mesh architecture did neither plan a company-wide program to introduce a data mesh nor did they secure lots of resources for a big data infrastructure project. But they also did not decide on a whim to try out a little data mesh experiment in some lab. Instead, the most successful approach is to carefully select a meaningful use case with a limited but valuable impact and then to provide all the support you have to make this first data product a success that can be demonstrated and learned from.