From Data Orientation into a Data Culture: The Preply Story

Featured image: seventyfourimages at Envato Elements

Since I joined Preply about two years ago, the company has gone through a process of deep transformation. We doubled in size, both in terms of revenue and headcount. We raised Series B and Series C rounds. We transitioned to a subscription business model. We iterated continuously on our product and rethought our strategy while scaling acquisition to grow our customer base in a sustainable fashion.

During this time, the company was in desperate need of exhaustive, reliable information. And the biggest challenge has been sailing the ship (or, better, riding the roller coaster) while in the process of building it. 

Lots of critical questions needed clear answers and quick. Although the basics weren’t entirely in place, we had to find ways to meet the most compelling needs and inform the company strategy while building the team and infrastructure required to do so.

This implied scaling the Data Chapter to 30+ members, building a management layer, deploying several new tools, reimplement product tracking, and resolving any governance issues. All at once.

From a data perspective, Preply’s strongest asset was and still is, our orientation to data. I’ve been in companies where the leadership team ran out of the room, pulling their hair each time I showed a chart. That is not the case here. ‘Preplers’ at any level are eager to base their decisions on data and demand more of it every week.

The challenge was turning such data orientation into a data culture

Problem Identification and the Need for a Change

The team already consolidated most data sources in a data warehouse and established a measurement framework. The biggest blocker, often an overlooked one, was data accessibility

Preply was stuck into the sadly common scenario in which people have no direct, flexible access to data, hence relying on Data Analysts to write SQL and build complex ad-hoc dashboards to answer their queries. 

The problem with this service model is that it quickly saturates. For each question an analyst answers, the requestor will come back with ten more. If you added more headcount, you’d probably be answering ten and receiving one hundred back.

Aside from not scaling, this approach is both frustrating for the stakeholder and demeaning for the analyst. The former sees their requests accumulating in a growing, slow-moving backlog and won’t get their answer on time. The latter ends up producing a stream of data points while lacking the context of why they’re needed and what their impatient stakeholders are trying to achieve. A recipe for burnout.

This problem was exacerbated by the cross-functional nature of our organization, which translates to over thirty teams attending. Each with a separate backlog and a unique set of priorities. Plus, of course, leadership and the board of directors. Hundreds of dashboards.

The solution was building a self-service layer to unlock data accessibility.

Steps for Transition from Data Orientation to Data Culture

First, we deployed Looker as our Business Intelligence tool for its self-service data exploration philosophy. Unlike traditional dashboarding tools, which require the previous work of data specialists, it makes end users autonomous.

Looker provides a semantic layer that allows for defining the underlying data warehouse tables and relationships, as well as the KPIs definitions and specific business rules required to exploit the data. In other words, it allows modelling the information and domain expertise otherwise stored in a data analyst’s mind through a relatively simple language (LookML). 

As the user drags and drops concepts in a familiar environment (similar to a Pivot table), Looker generates and runs the required SQL queries to then return the data and enable its visualization. The centralized logical model caters for a single source of truth, hence enforcing data governance and a consistent view of the business.

Finally, Looker provides embedding capabilities and an API layer which allows for building data applications. All of which are fully managed and integrated with Git for code forking and versioning.

The semantic layer concept isn’t new, as multiple BI vendors (Microstrategy, Business Objects, IBM, and more) have provided similar functionality years ago. Yet again, the versatility of LookML and the huge leap in performance made by the latest cloud data warehousing technologies allow for the utmost speed and flexibility. Although other vendors are trying to catch up (Microsoft, Thoughtspot and, more recently, DBT), they cannot compare in terms of completeness of vision and maturity.

Then, we introduced Snowflake for data warehousing. Superior UX dictated such choice, separation of storage and computing, support for virtually unlimited, almost-linear scaling (with the Enterprise version) and, especially, how gracefully it handles concurrency. That’s crucial for Looker customers as the latter often generates a daunting amount of concurrent queries that other technologies (AWS Redshift, to name one) struggle to process. 

We rely on Monte Carlo and its data observability platform for optimal reliability and lineage. Its out-of-the-box philosophy makes it painless to deploy, and it comes with automatic anomaly detection and lineage, along with support for more complex custom rules.

With a state-of-the-art BI stack in place, we could then deal with Data Science. We selected and deployed Databricks on top of Delta Lake. A convenient, fully-managed, cloud production environment featuring Spark Clusters and Python notebooks (among other components), accessible through an excellent environment with quality UX.

As we’re currently facing some limitations when training data-intensive ML Algorithms, such as Learning to Rank, for example, we’re exploring the latest innovations in the field. We have high hopes for newcomers QBeast, which leverages sophisticated sampling and indexing to allow parsing a fraction of the data, hence drastically reducing both the processing and training times while maintaining full compatibility with Spark.

Lastly, we deployed Amplitude for self-service product analytics and integrated it with the existing data platform.

Preply has a beautiful experimentation culture, and we run hundreds of AB tests each quarter. We are the proud creators of an in-house experimentation platform, which allows us to modify the user experience and measure the impact of our initiatives.

Tracking data was fit for experimentation but hadn’t been designed with analytics in mind. We found ourselves with a daunting 500+ undocumented events, plenty of product dependencies and little or no governance in place.

We resolved this by introducing a governance layer so that only approved events would reach Amplitude users through our integration. We would whitelist new, clean events while sanitizing the legacy ones. The data model and taxonomy ownership are now centralized, ensuring consistency.

With this architecture in place, Preplers can resolve most of their data needs autonomously. They can count on a team of motivated Data Analysts and Scientists if anything requires a deeper look. Eager to tackle complex problems and bring value to the business now that they’re relieved from pulling data for other people, day in and day out.

Once taken care of by the basics, the team is now focused on delivering business value through analytics and data science. We’re aiming to be innovators in marketing measurement, ranking and pricing, to name a few areas. Besides, of course, diving into customer behaviour to unlock business opportunities. Meanwhile, we’ve started a Data Academy program to increase data literacy across the company.

If you ever considered joining us, brace yourself. This is a rocket ship.

About the Author

Alessandro Pregnolato - VP of Data at  Preply

Alessandro Pregnolato is the VP of Data at  Preply, the online language tutoring marketplace. Best known for building and scaling the data function at Typeform, Marfeel, and Preply,  he’s been also advising several tech unicorns such as Moonpay and Paack.  He teaches SaaS product analytics at EADA Business  School in Barcelona, where he is  currently based. His previous  career was as a professional  musician. Alessandro’s experience in Business Intelligence and Data Science dates back twenty years, when he fell into the data world  by pure chance, joining Adobe as a production planner. Since then, he has refocused his career multiple times. Alessandro Pregnolato will speak at the 2023 edition of the Data Innovation Summit.

The views and opinions expressed by the author do not necessarily state or reflect the views or positions of Hyperight.com or any entities they represent.

Featured image: seventyfourimages at Envato Elements

Add comment