Monitoring at Spotify: Introducing Heroic


This is the second part in a series about Monitoring at Spotify. In the previous post I discussed our history of operational monitoring. In this part I’ll be presenting Heroic, our scalable time series database which is now free software. Heroic is our in-house time series database. We built it to address the challenges we […]

Monitoring at Spotify: The Story So Far


This is the first in a two-part series about Monitoring at Spotify. In this, I’ll be discussing our history, the challenges we faced, and how they were approached. Operational monitoring at Spotify started its life as a combination of two systems. Zabbix and a homegrown RRD-backed graphing system named “sitemon”, which used Munin for collection. […]

Improving the accessibility on our iOS client

Story Lots of the UI of our iOS application is rendered through an internal framework called Ceramic. It’s a tool that allows us to stitch together collection views with different layouts while keeping it memory efficient and covering the usual meta tasks like logging, loading and error handling. It was first used in the New Releases […]

Oh IPv6, Where Art Thou


Illustrations by Jonas Ekman Since the dawn of time, man has used 32-bit addressing. When the first Homo Erectus crawled out of the sea 6000 years ago, IPv4 infrastructure was already installed and the savannah was teeming with spam, flames and lewd ascii-art. Back in 1992, the Chief Architect, a man named Greg Internet, started […]

How we do large scale retrospectives

BIg Retro with everyone in one room

  Foreword: This post was initiated by Andy Park, former agile coach here at Spotify. For years we’ve been experimenting with how to do “big retrospectives”. That is, how to capture and spread learnings from big complex multi-site efforts involving dozens of teams. We used to do the traditional “get everyone into one big room” version, but now with […]

Designing the Spotify perimeter

Do not lean

In this blog post we focus on the web load balancers and various proxy systems across the Spotify perimeter. We go through the ways we expose our service network to the Internet and the challenges we faced while automating that. Team autonomy is a big part of the Spotify culture and in this blog post […]

Cassandra: Data-Driven Configuration

Spotify currently runs over 100 production-level Cassandra clusters. We use Cassandra across user-facing features, in our internal monitoring and analytics stack, paired with Storm for real-time processing, you name it. With scale come questions. “If I change my consistency level from ONE to QUORUM, how much performance am I sacrificing? What about a change to […]

Dealing with Java linking problems

Dependency Hell Most Java developers have probably run into problems where their code throws a NoSuchMethodError or a NoClassDefFoundError at runtime, despite compiling perfectly well. These issues can be very frustrating and hard to solve. This post tries to explain how they happen and explores some things that can be done to fix them. Why […]

Underflow bug

All of us are familiar with overflow bugs. However, sometimes you write code that counts on overflow. This is a story where overflow was supposed to happen but didn’t, hence the name underflow bug. Round-robin In our Java implementation of the round-robin algorithm, we store the number of connections in variable size and then we call index() % size to […]