Statistics in the Wild: How to Keep Calm and Carry On

American statistician Richard Royall once said, “Statistics today is in a conceptual and theoretical mess“. Mix it with the reproducibility crisis in science and the lack of proper statistical training among both data scientists and product managers, and you will understand how hard it is to make sense of data. Yet confident inferences and clear recommendations are exactly what is expected from data scientists. The good news is that there is a way to build teams and processes which will get things done. Disputed statistical techniques should be abandoned, context should become a focal point of all the research, and data scientists should be a natural part of planning and decision-making. Below we briefly discuss all these points.

Disputed Statistical Techniques Should be Abandoned

It will be fair to say that there is no unity among statisticians on how statistics should be done. There are various camps (frequentists, Bayesians, likelihoodists, etc.) and quite a lot of criticisms and counter-criticism in all possible directions. While the situation might seem like a deadlock, the good thing is that we can choose techniques based on such criticism and our specific needs. Most statisticians, hopefully, follow the same “eclectic” approach. Statistics is a tool to make sense of data, and our recommendation is to be skeptical about any technique and at least abandon techniques which might get you into trouble. An outline is as follows:

Avoid p-values when possible, especially when working directly with the business. It is absolutely in order to use them when analyzing coefficients of regression models, or for ad-hoc statistical tests, but do not communicate them to the business. If your stakeholders do not know what they represent, you will have a hard time explaining it, and if they do, you will have an even harder time arguing that changing thresholds after seeing the data is a poor scientific practice. Anyone without rigorous statistical training is susceptible to inverse fallacy and thinking that “p-value of 0.05 means that there is a 95% chance that treatment works”. You might jump from one zoom call to another doing something similar to what the American Statistical Association did, but it feels like a waste of everyone’s time.
Avoid freedom to choose prior distributions. When dealing with your stakeholders do not talk about them. You will know things went wrong when you would see a long queue of data scientists and stakeholders waving at you various “Neyman type A distributions” as their priors. Freedom to choose priors will lead to people spending time on doing it and there are much better ways to keep data scientists occupied (as Stanford statistician Steven Goodman said “The numbers are where the scientific discussion should start, not end.”)

Some context is necessary as one might argue that not much is left if we abandon all of the above. I am writing from a point of view of running statistics in the online mobile gaming industry. This industry is highly competitive and we tend to run a lot of statistical tests and most of these tests are unique in a sense that we usually do not test or retest the same product many times. Which puts us into a situation where we need to blend approaches developed for continuous testing (like one by Neyman and Pearson) with approaches tailored for specific tests (like subjective Bayesian). We ended up using non-parametric Bayes and some bits and pieces of Bayes factors here and there. In a nutshell, the highly-paced environment of online mobile games requires to automate and streamline inferences “anywhere where possible”. It also requires creation of useful user friendly tools, but that is a topic on its own.

Context Should Become the Focal Point of all Research

These days, everyone talks about the importance of understanding business or scientific context. A lot of effort is needed to learn methods and techniques of data science and statistics, so many students end up knowing regression and t-tests, but not knowing that both are meaningless on their own. First and foremost, the focus of data scientists should be on the situation and deep understanding of relevant factors and processes. Our advice is to create a culture and environment where data scientists would want to learn the context of your business. Without understanding how it operates, how economy works, data scientists might quickly turn into “data monkeys” who report point estimates and endlessly create dashboards. Here are main ideas:

Hire science-oriented curious researchers who want to know how the world works. Build a robust interviewing process, test their programming skills and knowledge, but most importantly, ask them “research questions without answers” and see how they react.
Monitor what motivates people, ask them what, how and why they do what they do. Help them to find passion, and be curious to understand business processes.
Embed data scientists into business teams. They should work closely with those who run the business and have direct access to information and people.

Data Scientists Should be a Natural Part of Planning and Decision-Making

If that is the case, if their opinion matters and they can influence decisions, they will feel ownership and responsibility, which will boost their commitment and help design better products. Here are main ideas:

Projects should start with data scientists present in the room. It will help business teams to omit mistakes, not to launch products which were doomed to fail from the very beginning, not to run poorly designed experiments, etc. Clear separation of responsibilities is needed, for example, data scientists can model signal-to-noise ratio for proposed products and help design experiments, while business teams can be in charge of designing products, strategic planning and making final decisions.
Data scientists should become thinking partners of product managers. Times when there were “I-know-everything” experts are long gone. There are just too many aspects of any business decision or process. Diversity of ideas and perspectives help to improve and innovate, and data scientists should be part of discussions of products, features, offers, etc. They can bring scientific thinking to the table, help distill vague ideas into actionable insights and brainstorm ways to simulate or check assumptions.

We discussed how to run statistics in the wild and how to keep calm. Key takeaway should be this: equip your data scientists with powerful tools and make them equal to the rest of the business. It will give you great competitive advantage as you will optimize processes faster, fail quickly and learn from your mistakes.

About the Author

Aleksandrs Gehsbargs is Director of the Games Data Science at Product Madness. He will speak at the Data Innovation Summit 2023. This is what he has to say about himself: “Since I was a teenager, I was passionate about mathematics and helping people learn and improve their skills. I studied in-depth mathematics and worked as a teacher for younger folks in parallel. After university, I became excited about machine learning and worked in the field for many years, building various predictive and descriptive models. It was a great journey into the world of using advanced machine learning methods to improve business processes. With time I got more interested in statistics and was surprised to discover that statistics is both extremely useful and does not make any sense. Today, I am leading a team of data scientists whose goal is to understand player behaviour by running A/B tests, simulating it using Monte-Carlo modelling and diving into depths of data.”

The views and opinions expressed by the author do not necessarily state or reflect the views or positions of Hyperight.com or any entities they represent.

Featured image: Pressmaster at Envato Elements

Cookie	Duration	Description
__cfduid	1 month	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
bp_user-registered	13 years 8 months 8 days	This cookie is used to set which users can access the private pages of the website. It is a functional cookie.
bp_user-role	13 years 8 months 8 days	This is a functional cookie. It is used to set restriction to the user on acessing certain pages like back office, account page etc.
bp_ut_session	13 years 8 months 8 days	This is a functional cookie. This cookie is used to set restriction to the user on acessing certain pages like back office, account page etc.

Cookie	Duration	Description
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.

Cookie	Duration	Description
IDE	1 year 24 days	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
test_cookie	15 minutes	This cookie is set by doubleclick.net. The purpose of the cookie is to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.

Cookie	Duration	Description
_gat_gtag_UA_62786802_1	1 minute	No description
CONSENT	16 years 9 months 21 days 15 hours 5 minutes	No description
ihc_workflow_restrictions_0	1 month	No description
ihcMedia	1 hour	No description

Statistics in the Wild: How to Keep Calm and Carry On

Disputed Statistical Techniques Should be Abandoned

Context Should Become the Focal Point of all Research

Data Scientists Should be a Natural Part of Planning and Decision-Making

About the Author

Add comment

Cancel reply

Recap: Day 1 at Data Innovation Summit 2024

Decoding Data Modeling: A Pillar of Modern Data Stacks and AI Cost Efficiency – Interview with Serge Gershkovich, SqlDBM

Next-Generation AI: Deeper Experiments – Interview with Sina Nek Akhtar, Tech Lead, Data Analytics and ML at Google Cloud

Recent posts

Recap: Day 1 at Data Innovation Summit 2024

Decoding Data Modeling: A Pillar of Modern Data Stacks and AI Cost Efficiency – Interview with Serge Gershkovich, SqlDBM

Next-Generation AI: Deeper Experiments – Interview with Sina Nek Akhtar, Tech Lead, Data Analytics and ML at Google Cloud

Electrolux Continuing Journey to Data-driven Manufacturing Excellence – Interview with Klaas Dobbelaere, Electrolux

Navigating the Next Wave: Generative AI at Accenture – Interview with Mattias Aspelund & Julia Falk, Accenture

The Future of AI-Enabled Experiences – Interview with Dr. Ather Gattami, Leading Swedish AI Expert, AI Researcher at Bitynamics

AIAW Podcast E125 – Liza-Maria Norlin

AIAW Podcast E124 – All about #DBRX AI Model – Hagay Lupesko

Topics

Email Newsletter

Events

Hyperight

Statistics in the Wild: How to Keep Calm and Carry On

Disputed Statistical Techniques Should be Abandoned

Context Should Become the Focal Point of all Research

Data Scientists Should be a Natural Part of Planning and Decision-Making

About the Author

Add comment

You may also like

Recent posts

Topics

Email Newsletter

Events

Hyperight