Having a Data platform has become something that is lately discussed as what will distinguish orgranisations from their competition. If an organisation has a Data product platform, that can say a lot about its business goals, data management and advanced analytics maturity. That means that the organisation built a tool with sustainability in mind.
Most organisations develop a Data platform to manage how Data is stored and transformed and to ensure that Data products are distributed at scale—based on that, making fast and reliable Data-driven decisions and providing value to the organisation.
What was the value that Adevinta, the global market leader in online classifieds, gained from its Data product platform? What does the infrastructure of this platform look like? The interview with Iker Martinez de Apellaniz, Product Lead at Data Foundations at Adevinta, provides the answers to these and many other questions. It also emphasised the steps that organisations should ask before starting the process of building a platform or scaling it.
Most importantly, this talk uncovers a new trend that is changing the mindset and approach of organisations when working on Data platforms.
Hyperight: Can you please tell us more about you? What is your professional background and current working focus?
Iker Martinez de Apellaniz: My background is in engineering. My first job in 2007 was already focused on creating reports on excel macros for insights and analytics, but I didn’t know at that time that Data would become my career. After that, I worked on different variations of similar tools, all with a focus on delivering Data to the people who needed it. I saw how the market evolved from uploading excel reports to FTP, to custom-built web tools for dashboarding, to the modern third-party tools we have today such as Tableau and Looker. Eventually, I moved deeper into the Data availability journey, not only in displaying Data, but also in producing it. Here, I discovered “Big Data” while setting up a “nutch” crawler and my first Hadoop cluster. The natural progression was then to start developing the tools for others to create these Datasets. And once there, I discovered that my strengths were not in creating the tool, but in discovering what the tool should look like, how to make it more user friendly and, most recently, how to develop the platforms and services needed to change the behaviours of a company. This gives companies the opportunity to collectively mature their Data culture, enabling them to create top quality Data products.
Hyperight: During this year’s NDSML Summit, you will share more on “Rise of the Data Products Platform”. What can the delegates at the event expect from your presentation?
Iker Martinez de Apellaniz: The presentation will tell the story of how Adevinta, and many companies like it, have grown their Data operations beyond a point of sustainability. I will explain which bottlenecks we’ve faced and give some tips to mitigate them. And I’ll show (as much as NDAs allow me) our plans for a future where Data products become first-class citizens, so a company’s mindset can embrace the Data mesh way of working.
Hyperight: To start with, can we define what Data products are? Are there different types of Data products and what advantages do they deliver to organisations?
Iker Martinez de Apellaniz: Data products is a confusing term. Some say it’s a Dataset, some say it’s any product that needs Data to run. For the sake of simplicity, when I talk about the Data Products Platform, I’ll refer to a system that creates Datasets under a series of requirements that bring it up to the standards we use across Adevinta. A Data product in this context is therefore a Dataset of the highest possible quality that is easy to discover and consume, and that brings value to the company.
Hyperight: When building Data products, were there any specific criteria or methodology you used, and what challenges did you encounter during the process?
Iker Martinez de Apellaniz: The first challenge was to reverse the long-established status quo of how we have been doing things in the industry. We needed to start by explicitly stating what we were trying to do: going beyond the columns and fields we create in a Dataset. We needed to make Data Engineers and Analysts think first about regulations and GDPR, as well as asking questions including:
● How and when will this Data be consumed?
● Has anyone created this Dataset before?
● How can many tenants consume this event?
● How will I monitor quality evolution and usage?
● How can I report lineage?
These are not top of mind today, as the biggest challenge is not with technology, but in creating habits inside the company that sustain organisational changes.
Hyperight: To ensure Data products are built and distributed at scale, at Adevinta, you’ve created a platform. When did you decide that you needed this type of platform and how can it benefit organisations?
Iker Martinez de Apellaniz: Speaking with the multiple marketplaces that make up Adevinta, it was clear that creating Datasets was already a solved issue, but it was also clear that there were multiple ways of doing it. Because of this, we decided to create a higher level of abstraction, a tool that could elevate this collection of Datasets into high-quality Data products. The idea behind this was that we would be able to expose the different silos of Data into a catalogue, which would then help us find and access it all. But to enable this catalogue to find the Data, collect lineage and help Adevinta grow, we needed to have a platform that aligned processes. This platform would then enable us to embark on this Data mesh journey.
Hyperight: What does the infrastructure of the Data platform look like?
Iker Martinez de Apellaniz: The Data Products Platform is the glue on top of many components. On one side, we store Data products. We call this part of the platform the “Data Lifecycle Management Platform”, or for a less fancy name, “DataLakeS-aaS”. It helps store Data, and also has some services and features that are required for Adevinta (and other European companies) to comply with. This platform is based on AWS S3 and IAM roles, but in a way that allows access management to be self-served for our users via a UI. It is the same UI that helps to manage Datasets in terms of retention, metaData management, access management and compliance. In the case of retention and compliance, the platform will apply these rules by deleting the Data when necessary or when a DDR request arrives at the platform. (There is no need to run your own jobs to delete or extract GDPR requests).
This GDPR processing uses the same component that the Data Products Platform uses to run and schedule jobs: a Kubernetes cluster for job processing called Unicron. This is also the same platform that supports our ML platform. The same UI that is used for storing Data also runs the Data catalogue. Based on Linkedin Datahub for the backend, we rebuilt the frontend to gain control over the interoperability of all these platforms. This way, we benefit from the connectors, metaData definitions, and contracts and infrastructure, while customising the final user experience to our needs. And then there is the Datasets platform, which creates the definition of the Dataset according to a set of rules, as well as the orchestration to run it over Unicron. It then initiates the source code for the transformation that produces the actual Data product.
Note that all these components, except the Governance UI, are multi-tenant. There is only one entry point for Data management and governance in Adevinta’s Product and Tech teams. The rest of the platforms allow for all of Adevinta’s teams and marketplaces to have their own cluster, bucket or space to use as they see fit, under some overarching rules and best practices coded within them.
Hyperight: Can you share some examples of how the platform is used and who the typical users of the platform are?
Iker Martinez de Apellaniz: The platform aims to help Analysts and Data Engineers that are not as proficient in software development and coding. It fills the gap between analytical exploration or prototyping (with notebooks, for example) and advanced Data products like complex Data pipelines or Machine Learning models. That doesn’t mean that it is only for non-advanced users. On the contrary, we believe that we can help our colleagues who are doing Datasets today to simplify the repetitive parts of that task, and provide off-the-self integration with other key components of the organisation like the catalogue, access control systems and compliance jobs, etc.
Hyperight: What conditions should be kept in mind if an organisation builds a platform for Data products? From your experience, are there any limitations with such a platform?
Iker Martinez de Apellaniz: I normally start my public talks with a slide that says: “It worked for me, it may not work for you”. Every company is different, so before investing in a platform like this, it’s important to understand the size, organisational structure, maturity and Data culture, as well as the skill set of the staff in the company. Adevinta is a company of companies, with many teams doing similar tasks every day. The Data Products Platform aims to reduce this duplication by showing the work other people are doing with Data and putting some order in the “chaos”.
So if your company has multiple teams using different technologies and strategies to create Datasets, you feel like there are too many sources of truth, there are debates on where the Data comes from or where it goes, or you want to embrace the Data mesh paradigm, then you might need to create some sort of Data Products Platform (or at least process) to consolidate it all. But remember that the first change you’ll need to do is not technical; it is in the mindset of the users, in the procedures, and even in the organisation.
Hyperight: What lessons have you and the company learned while building a Data products Platform?
Iker Martinez de Apellaniz: “It’s not that easy” and “It’s not only the platform”. You need to change many things before even starting to implement something like this. You need to understand the Data journey first: Where does it come from? Where does it go? Who is consuming it? Who is curating it? How does it change over time? You will need to familiarise yourself with the journey from the business need, to discovering where to get Data from, to developing a query to get some results, to prototyping, and finally to production. We understood the key parts of this journey and the friction points in our setup, so we were able to work on them. We decided to use a single Data catalogue for the whole company, to bet on standard Analytical Services, to create this specific platform and also to consolidate it over one ML platform. The lesson, again, is that it’s not easy, especially to have all of these platforms linked, well connected and able to move from one to another seamlessly.
Hyperight: Knowing that every company is at a different stage in their Data journey, and based on the journey of Adevinta, what is your final advice and recommendations for organisations that are considering building or are already in the process of building a Data Products Platform?
Iker Martinez de Apellaniz: Start with the message. Get a consistent message on why you want Data products, what this changes in your company, how the company will change to facilitate this journey, what benefits you and your colleagues will get and how everybody can contribute. Once the need and the benefits of having a Data product approach instead of Datasets are identified, the qualitative jump on analytics and ML, and the simplification of the daily work, is clear. You can then start sharing the message that a platform to consolidate the shape and requirements of any Data product will be required. Also, start small; you might not need a platform of this size or complexity.
Hyperight: Are there any trends you can see in the upcoming year or two when we speak about Data platforms and products and the synergy between them?
Iker Martinez de Apellaniz: I’ve started to observe Data mesh platforms coming into the market and Data platforms changing their approach to support Data mesh. I see this as a risk because Data mesh starts with a change of mind and culture, and trying to buy Data mesh is not going to work well. But on the more positive side, I think using SLA/SLOs for Data products is going to grow, and the complexity there will be how quality tools like Great Expectations or Deequ can work in these types of platforms and how they can work as a trigger of alarms that cascade FROM the producing teams TO the consumers, and not in the opposite direction, as it is happening currently.