Hyperight

Taming the reproducibility crisis in data science: Lars Albertsson

reproducibility

Reproducibility is a common challenge that plagues the scientific community. And data science is no exception. Everyone is talking and writing about how to overcome it, but few have had real progress.

For that purpose, we reached out to Lars Albertsson – a data engineering entrepreneur and founder of Mimeria and Scling. Lars has been working with data-intensive environments since 2007 and is certainly the person that can give some advice on the topic.

As Lars points out, in data science the scientific part is often forgotten and popular tools and practices in data science tend to deliver results that can’t be reproduced. After his presentation at Nordic Data Science and Machine Learning Summit 2019, we talked to Lars about how data scientists can deal with this crisis in the effort of putting science back into data science.

Lars Albertsson
Photo by Hyperight AB® / All rights reserved.

Hyperight: Hi Lars, you are a regular speaker at our Summits. This time we had the pleasure of listening to your presentation on “Taming the reproducibility crisis in data science” at Nordic Data Science and Machine Learning Summit 2019. There is a survey by Nature that states that more than 70% of researchers have tried and failed to reproduce another scientist’s experiments, and more than half have failed to reproduce their own experiments. Reproducibility, or lack of reproducibility, has its implications in data science as well. To start with, can we define what is reproducibility and why is it important in data science?

Lars: Thanks for having me at the NDSML Summit this year. It is always a pleasure to speak at your conferences, and meet so many enthusiastic practitioners!

Reproducibility is the ability to run a particular experiment again, and obtain the same results. In the case of data science, running an experiment usually means training a machine learning model and evaluating the results. If you cannot reproduce an experiment, it means that factors that you do not control affect the results. These factors can, for example, be new data arriving, or changes in the technical environment. While a data scientist rarely reproduces exactly the same experiment, they often make a change to the model or hyperparameters and then repeat the experiment. In such a scenario, if uncontrollable factors affect experiment results, the data scientist will not know whether their code change had a positive effect or not.

Lars Albertsson
Photo by Hyperight AB® / All rights reserved.

Hyperight: You talked about popular workflows, tools and practices in data science that don’t yield repeatable experiments. What are the implications of these experiments for a company?

Lars: While these workflows, tools, and practices have value, it is important to not rely on them for experimental evaluation. When starting out with a new model, any workflow will seem to work. During the first iterations of a model, there are typically large improvements, visible even with noisy experiments. As the model matures, however, the ability to distinguish signal from noise becomes critical. As a rule of thumb, about 10% of ideas for improving data-driven products are good ideas. So, unless companies have sharp measurement tools, they will essentially randomly change their products, sometimes even making them worse.

Hyperight: What would be your advice for companies that are also dealing with these challenges? How can they make sure they enable reproducibility and reiterative development?

Lars: Data science today is where software engineering was 20 years ago, with a large variation of heathen practices, which we will in a few years regard as obsolete and arcane. We have made the journey from manual practices to the DevOps culture that we have today, and we can learn from this journey to speed up the transition to DataOps, and then MLOps, AIOps, and whatever comes after that. One thing we have learnt from the DevOps transition is to make quality assurance a first-class citizen. QA competence and automation must be present in product teams. Form teams that align with product value streams. In order to build data-driven products that provide value, it is necessary to have a mix of software engineering, data engineering, quality assurance, operations, as well as data science. The latter is typically only a tiny part of product efforts, as illustrated by Google in the paper “Hidden Technical Debt in Machine Learning Systems”.

There are still organisations out there that struggle to move to DevOps, e.g. have segregated responsibilities or change approval boards in the process. These are cultural barriers that need to be removed before one can hope to build effective machine learning products. I strongly recommend the book “Accelerate” (Forsgren, Humble, Kim) to leaders in such organisations.

reproducibility
Photo by Hyperight AB® / All rights reserved.

Hyperight: What is your prediction in terms of tools and technology? Are we going to see tools that enhance and not hinder reproducibility in the near future?

Lars: In the last year or two, we have seen an increase in the number of tools in the areas of provenance and data discovery, as well as workflow orchestration. The latter is essentially the control plane of scalable data processing. Much remains to be done in these areas, but I think that our toolbox will increase. I am concerned, however, that many of these tools are heavy-weight, enterprise-style tools, which do not compose well and integrate easily in a larger context, making reproducibility difficult.

reproducibility
Photo by Hyperight AB® / All rights reserved.

Hyperight: And lastly, what are the challenges when establishing technical environments that support reproducibility?

Lars: Keeping things simple and slow. Many data platforms grow too quickly in terms of complexity and heterogeneity. Problems are solved too often by adding a new component. When data is spread over many components, in many formats, keeping track of metadata and lineage becomes difficult.

I also think that the trend towards real-time processing is harmful to reproducibility. It is easier to keep track of your training and evaluation data in lumps of million-record daily datasets, rather than unbounded streams of millions of records per day. Stream processing has its uses, but it is a choice with pros and cons, and I rarely see a good understanding of the tradeoffs involved. There is a tradeoff between data speed and innovation speed, which most practitioners are unaware of.


Add comment