There have been many efforts during the past few years to raise awareness of gender equality in the IT industry. But it’s been slow progress – and we don’t like slow! So we decided to do things differently. Instead of just hoping to achieve gender equality, we made it a requirement for our hackathon “Diversify”.
By Matt Brown and Kinshuk Mishra
At Spotify we have have over 60 million active users who have access to a vast music catalog of over 30 million songs. Our users have a choice to follow thousands of artists and hundreds of their friends and create their own music graph. On our service they also discover new and existing content by experiencing a variety of music promotions (album releases, artist promos), which get served over our ad platform. These options have empowered our users and made them really engaged. Over time they have created over 1.5 billion playlists and just last year they streamed over 7 billion hours worth of music. Continue reading
Spotify has built several real-time pipelines using Apache Storm for use cases like ad targeting, music recommendation, and data visualization. Each of these real-time pipelines have Apache Storm wired to different systems like Kafka, Cassandra, Zookeeper, and other sources and sinks. Building applications for over 50 million active users globally requires perpetual thinking about scalability to ensure high availability and good system performance. Continue reading
Sometimes the answer to a sluggish data pipeline isn’t more power in the Hadoop cluster, but a shift in technique. We hit one of these moments recently at Spotify. One of our critical ad analysis pipelines had issues. First it was slow. Then a few days later it was dead, unrunnable at less than 20GB memory/reducer.
We traced the problem back to a single bottleneck: one expensive join and a handful of overloaded reducers. We solved things by switching up our join strategy, in the process cutting memory usage by over 75%. Here’s how. Continue reading
For my master’s thesis, I developed and benchmarked an Apache Cassandra compaction strategy optimized for time series. The result, the Date-Tiered Compaction Strategy (DTCS), has recently been included in upstream Cassandra. We now use it in production at Spotify.
Marcus Eriksson has written another blog post about this feature on the DataStax Developer Blog.
What is a compaction strategy?
The data files that Cassandra nodes store on disk are called sorted string tables (SSTables). They are essentially plain sorted arrays of data. Cassandra’s superior performance lies in its log-structured storage; it recognizes just how much more expensive random seeks are compared to sequential operations on modern hardware. Granted, this difference is larger on hard disk drives than on solid-state drives. Still, keeping data from fragmenting holds fundamental importance to Cassandra’s overall performance. Continue reading
All of our lovely Spotify users generate many terabytes of data every day. All the songs that are listened to, all the playlists you make, all the people you follow, and all the music you share. Somehow we need to organise, process and aggregate all of this into meaningful information out the other side. Here are just a few of the things we need to get out of the data:
- Reporting to record labels and rights holders so we can make sure everyone gets paid
- Creating toplists of what is the most popular music right now
- Getting feedback on how well different aspects of the product are working so we can improve the user experience
- Powering our intelligent radio and discovery features
To store and process all this data we use Hadoop, a framework for distributed storage and processing of huge amounts of data. Continue reading
Here’s part 2 of the animated video describing our engineering culture. Check out part 1 first if you haven’t already seen it!
This is a journey in progress, not a journey completed, so the video is somewhere between “How Things Are Today” and “How We Want Things To Be”.
Here’s the whole drawing:
(Tools used: Art Rage, Wacom Intuos 5 drawing tablet, and ScreenFlow)