Understanding the Spotify Web API

Six months ago, when we launched our Web API, we provided twelve endpoints through which developers could retrieve Spotify catalog data. Today the API has 40 distinct endpoints and more are being added all the time. In this post, I’d like to take you on a brief tour of the API and show you some of the programs that have already been developed with it. Continue reading

Diversify – Creating a Hackathon with 50/50 Female and Male Participants

There have been many efforts during the past few years to raise awareness of gender equality in the IT industry. But it’s been slow progress – and we don’t like slow! So we decided to do things differently. Instead of just hoping to achieve gender equality, we made it a requirement for our hackathon “Diversify”.
1 Continue reading

Personalization at Spotify using Cassandra


By Matt Brown and Kinshuk Mishra

At Spotify we have have over 60 million active users who have access to a vast music catalog of over 30 million songs. Our users have a choice to follow thousands of artists and hundreds of their friends and create their own music graph. On our service they also discover new and existing content by experiencing a variety of music promotions (album releases, artist promos), which get served over our ad platform. These options have empowered our users and made them really engaged. Over time they have created over 1.5 billion playlists and just last year they streamed over 7 billion hours worth of music. Continue reading

How Spotify Scales Apache Storm

Spotify has built several real-time pipelines using Apache Storm for use cases like ad targeting, music recommendation, and data visualization. Each of these real-time pipelines have Apache Storm wired to different systems like Kafka, Cassandra, Zookeeper, and other sources and sinks. Building applications for over 50 million active users globally requires perpetual thinking about scalability to ensure high availability and good system performance. Continue reading

Solving MapReduce Performance Problems With Sharded Joins

Sometimes the answer to a sluggish data pipeline isn’t more power in the Hadoop cluster, but a shift in technique. We hit one of these moments recently at Spotify. One of our critical ad analysis pipelines had issues. First it was slow. Then a few days later it was dead, unrunnable at less than 20GB memory/reducer.

We traced the problem back to a single bottleneck: one expensive join and a handful of overloaded reducers. We solved things by switching up our join strategy, in the process cutting memory usage by over 75%. Here’s how. Continue reading

Date-Tiered Compaction in Apache Cassandra

For my master’s thesis, I developed and benchmarked an Apache Cassandra compaction strategy optimized for time series. The result, the Date-Tiered Compaction Strategy (DTCS), has recently been included in upstream Cassandra. We now use it in production at Spotify.

Marcus Eriksson has written another blog post about this feature on the DataStax Developer Blog.

What is a compaction strategy?

The data files that Cassandra nodes store on disk are called sorted string tables (SSTables). They are essentially plain sorted arrays of data. Cassandra’s superior performance lies in its log-structured storage; it recognizes just how much more expensive random seeks are compared to sequential operations on modern hardware. Granted, this difference is larger on hard disk drives than on solid-state drives. Still, keeping data from fragmenting holds fundamental importance to Cassandra’s overall performance. Continue reading