As a data scientist, working at a data-first company leads to many interesting challenges. It is not only about building music recommendations, but also about being able to performing advanced analytics and machine learning on peta-byte level.
- What do Spotify use all peta-bytes of data for?
- Isn't it sufficient to take a sample and train models on a single machine?
- Is Apache Spark a silver-bullet to distributed computing?