Taming the reproducibility crisis in data science: Lars Albertsson

Reproducibility is a common challenge that plagues the scientific community. And data science is no exception. Everyone is talking and writing about how to overcome it, but few have had real progress.

For that purpose, we reached out to Lars Albertsson – a data engineering entrepreneur and founder of Mimeria and Scling. Lars has been working with data-intensive environments since 2007 and is certainly the person that can give some advice on the topic.

As Lars points out, in data science the scientific part is often forgotten and popular tools and practices in data science tend to deliver results that can’t be reproduced. After his presentation at Nordic Data Science and Machine Learning Summit 2019, we talked to Lars about how data scientists can deal with this crisis in the effort of putting science back into data science.

Hyperight: Hi Lars, you are a regular speaker at our Summits. This time we had the pleasure of listening to your presentation on “Taming the reproducibility crisis in data science” at Nordic Data Science and Machine Learning Summit 2019. There is a survey by Nature that states that more than 70% of researchers have tried and failed to reproduce another scientist’s experiments, and more than half have failed to reproduce their own experiments. Reproducibility, or lack of reproducibility, has its implications in data science as well. To start with, can we define what is reproducibility and why is it important in data science?

Lars: Thanks for having me at the NDSML Summit this year. It is always a pleasure to speak at your conferences, and meet so many enthusiastic practitioners!

Reproducibility is the ability to run a particular experiment again, and obtain the same results. In the case of data science, running an experiment usually means training a machine learning model and evaluating the results. If you cannot reproduce an experiment, it means that factors that you do not control affect the results. These factors can, for example, be new data arriving, or changes in the technical environment. While a data scientist rarely reproduces exactly the same experiment, they often make a change to the model or hyperparameters and then repeat the experiment. In such a scenario, if uncontrollable factors affect experiment results, the data scientist will not know whether their code change had a positive effect or not.

Hyperight: You talked about popular workflows, tools and practices in data science that don’t yield repeatable experiments. What are the implications of these experiments for a company?

Lars: While these workflows, tools, and practices have value, it is important to not rely on them for experimental evaluation. When starting out with a new model, any workflow will seem to work. During the first iterations of a model, there are typically large improvements, visible even with noisy experiments. As the model matures, however, the ability to distinguish signal from noise becomes critical. As a rule of thumb, about 10% of ideas for improving data-driven products are good ideas. So, unless companies have sharp measurement tools, they will essentially randomly change their products, sometimes even making them worse.

Watch the full presentation with Lars Albertsson

Hyperight: What would be your advice for companies that are also dealing with these challenges? How can they make sure they enable reproducibility and reiterative development?

Lars: Data science today is where software engineering was 20 years ago, with a large variation of heathen practices, which we will in a few years regard as obsolete and arcane. We have made the journey from manual practices to the DevOps culture that we have today, and we can learn from this journey to speed up the transition to DataOps, and then MLOps, AIOps, and whatever comes after that. One thing we have learnt from the DevOps transition is to make quality assurance a first-class citizen. QA competence and automation must be present in product teams. Form teams that align with product value streams. In order to build data-driven products that provide value, it is necessary to have a mix of software engineering, data engineering, quality assurance, operations, as well as data science. The latter is typically only a tiny part of product efforts, as illustrated by Google in the paper “Hidden Technical Debt in Machine Learning Systems”.

There are still organisations out there that struggle to move to DevOps, e.g. have segregated responsibilities or change approval boards in the process. These are cultural barriers that need to be removed before one can hope to build effective machine learning products. I strongly recommend the book “Accelerate” (Forsgren, Humble, Kim) to leaders in such organisations.

Hyperight: What is your prediction in terms of tools and technology? Are we going to see tools that enhance and not hinder reproducibility in the near future?

Lars: In the last year or two, we have seen an increase in the number of tools in the areas of provenance and data discovery, as well as workflow orchestration. The latter is essentially the control plane of scalable data processing. Much remains to be done in these areas, but I think that our toolbox will increase. I am concerned, however, that many of these tools are heavy-weight, enterprise-style tools, which do not compose well and integrate easily in a larger context, making reproducibility difficult.

Hyperight: And lastly, what are the challenges when establishing technical environments that support reproducibility?

Lars: Keeping things simple and slow. Many data platforms grow too quickly in terms of complexity and heterogeneity. Problems are solved too often by adding a new component. When data is spread over many components, in many formats, keeping track of metadata and lineage becomes difficult.

I also think that the trend towards real-time processing is harmful to reproducibility. It is easier to keep track of your training and evaluation data in lumps of million-record daily datasets, rather than unbounded streams of millions of records per day. Stream processing has its uses, but it is a choice with pros and cons, and I rarely see a good understanding of the tradeoffs involved. There is a tradeoff between data speed and innovation speed, which most practitioners are unaware of.

Watch the full presentation with Lars Albertsson

Cookie	Duration	Description
__cfduid	1 month	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
bp_user-registered	13 years 8 months 8 days	This cookie is used to set which users can access the private pages of the website. It is a functional cookie.
bp_user-role	13 years 8 months 8 days	This is a functional cookie. It is used to set restriction to the user on acessing certain pages like back office, account page etc.
bp_ut_session	13 years 8 months 8 days	This is a functional cookie. This cookie is used to set restriction to the user on acessing certain pages like back office, account page etc.

Cookie	Duration	Description
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.

Cookie	Duration	Description
IDE	1 year 24 days	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
test_cookie	15 minutes	This cookie is set by doubleclick.net. The purpose of the cookie is to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.

Cookie	Duration	Description
_gat_gtag_UA_62786802_1	1 minute	No description
CONSENT	16 years 9 months 21 days 15 hours 5 minutes	No description
ihc_workflow_restrictions_0	1 month	No description
ihcMedia	1 hour	No description

Taming the reproducibility crisis in data science: Lars Albertsson

Add comment

Cancel reply

Next-Generation AI: Deeper Experiments – Interview with Sina Nek Akhtar, Tech Lead, Data Analytics and ML at Google Cloud

Electrolux Continuing Journey to Data-driven Manufacturing Excellence – Interview with Klaas Dobbelaere, Electrolux

Navigating the Next Wave: Generative AI at Accenture – Interview with Mattias Aspelund & Julia Falk, Accenture

Recent posts

Next-Generation AI: Deeper Experiments – Interview with Sina Nek Akhtar, Tech Lead, Data Analytics and ML at Google Cloud

Electrolux Continuing Journey to Data-driven Manufacturing Excellence – Interview with Klaas Dobbelaere, Electrolux

Navigating the Next Wave: Generative AI at Accenture – Interview with Mattias Aspelund & Julia Falk, Accenture

The Future of AI-Enabled Experiences – Interview with Dr. Ather Gattami, Leading Swedish AI Expert, AI Researcher at Bitynamics

AIAW Podcast E125 – Liza-Maria Norlin

AIAW Podcast E124 – All about #DBRX AI Model – Hagay Lupesko

Semantic Layers: Your Strategic Advantage for AI-driven Insights – Interview with Ernesto Ongaro, dbt Labs

Data Innovation Summit 2024: What You Can’t Afford to Miss!

Topics

Email Newsletter

Events

Hyperight

Taming the reproducibility crisis in data science: Lars Albertsson

Add comment

You may also like

Recent posts

Topics

Email Newsletter

Events

Hyperight