Big Data Processing at Spotify: The Road to Scio (Part 2)

In this part we’ll take a closer look at Scio, including basic concepts, its unique features, and concrete use cases here at Spotify. Basic Concepts Scio is a Scala API for Apache Beam and Google Cloud Dataflow. It was designed as a thin wrapper on top of Beam’s Java SDK, while offering an easy way […]


TC4D: Data Quality By Engineers, For Engineers

Changing an engineering culture is one of the biggest challenges for any organization. It requires challenging an existing way of working, and introducing compelling improvements that are adopted by individuals as well as the departments. Tackling tech debt instead of amassing it, using new tooling and infrastructure, or in this case, increasing test coverage. After […]


Big Data Processing at Spotify: The Road to Scio (Part 1)

This is the first part of a 2 part blog series. In this series we will talk about Scio, a Scala API for Apache Beam and Google Cloud Dataflow, and how we built the majority of our new data pipelines on Google Cloud with Scio. Scio > Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i̯o] > Verb: […]


Commoditizing Music Machine Learning : Services

Five years ago, music personalization at Spotify was a tiny team. The team read papers, developed models, wrote data pipelines and built services. Today personalization involves multiple teams in New York, Boston & Stockholm producing datasets, feature engineering and serving up products to users. Features like Discover Weekly and Release Radar are but the tip of […]


Personalization at Spotify using Cassandra

  By Matt Brown and Kinshuk Mishra At Spotify we have have over 60 million active users who have access to a vast music catalog of over 30 million songs. Our users have a choice to follow thousands of artists and hundreds of their friends and create their own music graph. On our service they also […]


How Spotify Scales Apache Storm

Spotify has built several real-time pipelines using Apache Storm for use cases like ad targeting, music recommendation, and data visualization. Each of these real-time pipelines have Apache Storm wired to different systems like Kafka, Cassandra, Zookeeper, and other sources and sinks. Building applications for over 50 million active users globally requires perpetual thinking about scalability […]


Data Processing with Apache Crunch at Spotify

All of our lovely Spotify users generate many terabytes of data every day. All the songs that are listened to, all the playlists you make, all the people you follow, and all the music you share. Somehow we need to organise, process and aggregate all of this into meaningful information out the other side. Here […]