How VR Group uses machine learning to predict passenger number

The passenger number is a crucial figure in railway operations that impact the overall functioning of the commuter railway traffic. But in the case of VR Group, Finland’s government-owned railway company, with over 14 million passengers carried on long-distance rail services only in 2019, predicting passenger number requires additional helping hand by machine learning.

Heikki Pulkkinen, Lead Data Scientist at VR Group, recounted first-hand the process of creating and deploying a machine learning model in the VR Group for the purpose of predicting passenger count, and the value they got out of it at the Data Innovation Summit 2019.

Why VR uses machine learning for passenger prediction

As mentioned, the goal was to predict the number of passengers on each of VR Group’s trains. The prediction is needed for different things, states Heikki, such as:

Having the right amount of conductors on the train
Shift planning
Ensuring the right number of wagons
Revenue management – where they increase the price of tickets if there’s increased demand and providing more tickets for sale closer to the departure time
Pricing
Train timetable planning.

Increasing the number of conductors can be done 2 days before the train departs, but if they want to add one more wagon they need to know the prediction about 2 months beforehand, which is why accurate predictions are that much crucial.

Two approaches for predicting passenger number

Heikki mentions two main approaches VR Group uses to look at their data.

Predictions based on the sales history of trains where they detect vehicle pattern for the same days of the week, for example, increased demand for tickets on Fridays and Sundays, while less demand on Saturdays.
Looking at the booking data or number of booked tickets for a train that soon departs.

Sales history of VR Group — Image credits: Heikki Pulkkinen’s Data Innovation Summit PDF presentation

Booking curve at VR Group — Image credits: Heikki Pulkkinen’s Data Innovation Summit PDF presentation

Thinking about the data and the kind of model to use, Heikki was considering different models like ANOVA models, s2s neural network models. But he came to the conclusion that they would get lost if they used many different and complicated models. So he took a step back and decided to KISS – “keep it simple, stupid”.

Heikki served the predictions in a simple table format which contained the train name and number, departure date, and the number of passengers between station intervals.

He based the passenger number calculations on history datasets which contain data on regular tickets, serial tickets, season tickets, and travels that are exempt from buying tickets for various reasons. To be able to calculate the predictions, he also needed to include features, such as prediction date, and related features such as days to departure, number of tickets sold until that date, number of tickets sold a week before, etc.

Also, features from both approaches can be combined in the calculation, which is a big benefit.

Model selection for predicting passenger number

The formula that Heikki used is a basic regression model with extreme gradient boosting because there was no need for hyperparameter optimisation, and the more estimators he added the more accurate it was.

The most important features they used in the model were the number of reservations by the prediction date relating to the booking data, and the number of reservations on a comparable day relating to the sales history.

Apart from these features, Heikki mentions one interesting feature that they included related to work and business share. It’s a result of a different machine learning model based on a questionnaire data on the purpose of the trip.

To test the accuracy of the model, they split the data into three different training sets: training set, test set and future set. As expected, the further the departure date is, the more inaccurate the model becomes, Heikki states, as there’s less data available. The error rate was between 10 and 20 passengers, which is great taking into account that there are around 140 passengers per train becomes 10% error.

Deploying the model in production

VR Group’s data architecture consisted of several tools:

Data science platform EC2
Snowflake data warehouse from where they transferred the data to Jupiter notebook
S3 where they stored the trained models
Git repository where they stored the code which calculates the features
AWS CloudWatch for running predictions in batch processing
AWS Batch to launch a container instance and calculate the predictions. It takes input data from Snowflake and returns back predictions.
Power BI where the predictions were served for the business people who made the decisions of adding extra wagons and conductors, and where they tracked model performance.

Data architecture at VR Group — Image credits: Heikki Pulkkinen’s Data Innovation Summit PDF presentation

The takeaway

Summarising his presentation, Heikki highlights some learnings he got along the way. He advises avoiding being stuck on what model to use for the purpose of decreasing the error rate. Instead of spending too much time in this loop, Heikki states that it’s important to move to production as soon as possible because this stage can provide real insight into problems that didn’t appear in the development.

After deploying the model, it’s crucial to monitor the performance of the models. In VR Group’s case, Heikki discovered data leak. His test and training error was good, but the serving error was far from it, and performance monitoring helped him detect it.

And lastly, it is important to gather feedback from the end-users as early as possible. In this case, major changes to the output column definition had to be done after the initial version was complete.

Watch Heikki Pulkkinen’s Data Innovation Summit 2019 presentation

Watch Heikki Pulkkinen’s Data Innovation Summit 2019 interview

Cookie	Duration	Description
__cfduid	1 month	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
bp_user-registered	13 years 8 months 8 days	This cookie is used to set which users can access the private pages of the website. It is a functional cookie.
bp_user-role	13 years 8 months 8 days	This is a functional cookie. It is used to set restriction to the user on acessing certain pages like back office, account page etc.
bp_ut_session	13 years 8 months 8 days	This is a functional cookie. This cookie is used to set restriction to the user on acessing certain pages like back office, account page etc.

Cookie	Duration	Description
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.

Cookie	Duration	Description
IDE	1 year 24 days	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
test_cookie	15 minutes	This cookie is set by doubleclick.net. The purpose of the cookie is to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.

Cookie	Duration	Description
_gat_gtag_UA_62786802_1	1 minute	No description
CONSENT	16 years 9 months 21 days 15 hours 5 minutes	No description
ihc_workflow_restrictions_0	1 month	No description
ihcMedia	1 hour	No description

How VR Group uses machine learning to predict passenger number

Why VR uses machine learning for passenger prediction

Two approaches for predicting passenger number

Model selection for predicting passenger number

Deploying the model in production

The takeaway

Add comment

Cancel reply

Recap: Day 2 at Data Innovation Summit 2024

Recap: Day 1 at Data Innovation Summit 2024

Decoding Data Modeling: A Pillar of Modern Data Stacks and AI Cost Efficiency – Interview with Serge Gershkovich, SqlDBM

Recent posts

Recap: Day 2 at Data Innovation Summit 2024

Recap: Day 1 at Data Innovation Summit 2024

Decoding Data Modeling: A Pillar of Modern Data Stacks and AI Cost Efficiency – Interview with Serge Gershkovich, SqlDBM

Next-Generation AI: Deeper Experiments – Interview with Sina Nek Akhtar, Tech Lead, Data Analytics and ML at Google Cloud

Electrolux Continuing Journey to Data-driven Manufacturing Excellence – Interview with Klaas Dobbelaere, Electrolux

Navigating the Next Wave: Generative AI at Accenture – Interview with Mattias Aspelund & Julia Falk, Accenture

The Future of AI-Enabled Experiences – Interview with Dr. Ather Gattami, Leading Swedish AI Expert, AI Researcher at Bitynamics

AIAW Podcast E125 – Liza-Maria Norlin

Topics

Email Newsletter

Events

Hyperight

How VR Group uses machine learning to predict passenger number

Why VR uses machine learning for passenger prediction

Two approaches for predicting passenger number

Model selection for predicting passenger number

Deploying the model in production

The takeaway

Add comment

You may also like

Recent posts

Topics

Email Newsletter

Events

Hyperight