Mapping DevOps learnings to management

There are many blog posts, articles and tweets about DevOps out there on the internet. Some of them discuss the pros/cons, some the consequences of its introduction, while others discuss how it was implemented.

Although this article refers to some DevOps adoption aspects, its main focus is on applying DevOps principles to a different area: engineering leadership.

In this article, I’ll refer to “us” every now and then, and to avoid confusion, here’s a quick intro:

  • Ingrid Franck is the engineering team’s agile coach.
  • Ramon van Alteren is the engineering team’s product owner.
  • Mattias Jansson (yours truly) is the engineering team’s chapter lead, sometimes lazily called a team lead, like I do in this article.

The engineering team itself consists at the time of writing of nine people.

DevOps culture at Spotify

Many parts of DevOps culture have pervaded Spotify from its early beginnings.

The first six people employed by Spotify were engineers, one of whom had an Operational role. This was back when Spotify was just another startup in an apartment. This background and admission of the importance of operational thinking from the very beginnings of Spotify history has heavily influenced the relationship between Dev and Ops.

We’ve come a long way since then- there are hundreds of engineers at Spotify now, spread out in four cities and three time zones. Although the DevOps mentality does not permeate the hearts and souls of every individual in the engineering team, and though it is not actually mentioned by name anywhere, one can see it show up everywhere in the day-to-day workflow as well as in conversations by the coffee machine.

Many startups are staffed by developers and the odd business guy. In those firms the operations engineer is hired once the code has been written and there is a need for someone to deploy and maintain the system. At Spotify, the two camps are overlapping in responsibilities, skill sets and interests. We have some uncommonly opsish developers here, and likewise many of our ops engineers have a strong developer background. The advantage of having this overlap is immense in the day-to-day, solving potential blockers long before they arise.

Backend developers deploy their code in production by themselves, with or without an ops engineer to hold their hand. This in turn, more often than not, encourages the dev in question to think seriously about traditionally operations-focussed problem areas such as monitoring, logging, packaging, and availability.

Since we have thousands of servers in production, our ops engineers have moved from thinking in terms of individual servers to clusters of servers. One-off manual fixes on individual servers is avoided when possible- instead, the ops engineer, through code, modifies the state of the authoritative data model of the backend, which in turn reflects onto reality via our configuration management system, backed by puppet.

Backend services typically have two so-called System Owners – one from Dev and one from Ops. Their core responsibilities reside in their respective dominions – the dev system owner owns the code, design and architecture, while the ops system owner owns the service once life is blown into it when it is deployed and is serving traffic. However, these two areas have great overlap, and thus the two system owners have regular checkups to discuss scalability, changes in neighbouring backend topology, coming new products which will affect service behaviour, etc.

A final example is how we have a dedicated team working on automation tools and services. Developer and Operations staff work side-by-side for weeks at a time solving specific problems, raised and prioritised by Ops themselves.

All this being said, our organisation still has a long way to go. The numbers of customers, servers, data centers, services, offices, and staff are growing all the time, and thus yesterday’s solutions have a marked tendency of scaling badly into the present. On top of this, the ratio of Dev to Ops staff has changed in a way that has diluted the DevOps mentality in some ways.

Lessons learned of DevOps

The lessons the DevOps movement has taught us are many, but one of the most important is the value of aligning the goals of Dev and Ops. Get them to work side-by-side, give them space to learn from each other. By getting the two groups to communicate regularly, the developer will have a chance at understanding the reasons why Ops need to act a blocker at times- and will learn how to plan ahead and produce changes in alignment with the requirements of the Operational environment. Also, once Ops start to see their hardware and the services running on them as malleable datastructures which one can apply code upon, the developer suddenly has a different reach- he/she will be able to affect not just the code as it exists in packages, but will have much more flexibility on how the packages are applied in production.

Likewise, by injecting operational thinking into the development process, the frequency with which Operations engineers need to spend time on interruptions and clean-up is lowered, and their time can be spent on longer-term projects.

Generally, the DevOps work methods have helped both sides of the organisation think of the entire system – the value-stream - not just the components they are traditionally responsible for.

Problems in Management-land

In many modern tech companies, one can find three distinct responsibilities which tangent an engineering team. In some smaller firms, and indeed even in larger ones, these responsibilities are typically gathered in one or two roles. At Spotify, we try to separate them so that three separate people own these distinct responsibilities. So what responsibilities do these roles entail?

  • The Product owner is accountable for delivering products to one or more stakeholders in a timely fashion. (PO)
  • The Team lead’s mission is to maintain the team so its members are fit for the challenges expected of them, and is responsible for the architectural soundness of the internals of the product. (TL)
  • The Agile coach is dedicated to nurture an environment of engaged and healthy team members, that continuously improve themselves as team members, their product deliveries and their team collaboration. (AC)

These three roles are at times at odds with each other. In most organisations, the people who have these roles have differing missions and can pull the team in different directions.

Have you ever experienced a conflict between the Product owner, pushing for an essential new feature, and the Team lead, who is concerned with the team’s frustrations over the mountain of technical debt in the existing codebase? Or the Agile coach who feels that the team needs to stop and reflect more often in order to figure out how to improve themselves, but the Product owner hesitates and seems worried that the rhythm of the team will be disrupted? Or when the Team lead feels that the team is agile enough and actively blocks the increasingly agitated attempts by the Agile coach to help the team help themselves?

The problems we mention above are in many ways similar to the conflicting goals between Dev and Ops groupings in archetypical firms.

Companies new to DevOps often discover blockers (structural, social, cultural, etc) inside their firm, making adoption difficult. Even firms where DevOps prevails will find plenty to disagree upon. It is often the case that the entire problem set resembles two people in a single bed with a blanket that is too small to cover both. The result is a lot of pulling/shuffling of this blanket to try and cover all the exposed parts.

It is not unusual to see old-school Operations engineers who refuse to see infrastructure as code, preferring to manually modify configuration files on target machines; Developers who look down on operational work, who feel that they their job is done once the build completes and that whatever happens when the code hits bare metal is someone else’s problem. Dev and Ops leadership who are at odds with each other because of a mismatch of missions (the number of features shipped vs. keeping downtime at a minimum.)

The thing is, what makes DevOps so attractive is that it’s all about encouraging developers and Ops engineers to talk and learn from each other. At the end of the day, it’s all about communication. About aligning goals. Once we start listening to each other, we will have taken the first step towards some sort of DevOps synergy.

So… what happens when you do the same thing with an engineering team’s closest leadership figures?

What if we get the PO, TL and the AC to regularly talk about their concerns, their short and long-term goals for the team, and to teach each other the realities within which they live? Will we find similar synergy effects in these three roles? Will we not only eliminate conflicts between them, but also find something… more?

Potlac

That’s what we did, six months ago. The three of us had never worked together in this particular constellation before, and we were willing to do some serious experimentation. Our aim was to try to minimise misunderstandings and to possibly get some sort of synergy effect.

So what did we do? How did we apply the lessons of DevOps into our work?

Weekly sync meeting

Half an hour every week, we discussed the current state, from each of our perspectives. Each brought at least one topic to the session, which we then discussed and digested together. Example topics would include increasing stakeholder involvement, upcoming conferences, or the theme of the next retrospective.

Quick chats and sync before key meetings

Before one of us held a critical meeting, we would have a quick chat with the others to get last-minute feedback. One would reiterate the purpose of the meeting, if the goal(s) were realistic, or what one should do to truly get the involvement of the meeting attendees.

Regular one-on-ones

Each of us also have 1:1s with each other once a week- either at the office, a quick phone call in the evening or over lunch. By decreasing the members of the discussion by one, the tone of the conversation and the problems raised became more personal, but all the while orbiting our common goals.

Mock meetings

If a meeting was special, or the agenda experimental, we would hold mock meetings with each other, practicing the tricky parts to check for holes in reasoning or for discovering unexplained assumptions.

These are the four most obvious ways in which we worked together. The emergent property of this group of three was that we started to think in each other’s shoes, and in some sense each of us was suddenly wearing all three hats – albeit our original one still larger than the other two. It made us stronger as a group, and the more we discussed the further we deepened our cooperation.

A critical success factor for the setup described is that we shared a strong commitment to the team functioning as opposed to a single engineer functioning. In our opinion, this is a fairly large part of what made this work; the shared idea that the team is bigger than the sum of its parts.

Some time after we had worked in this manner, we accidentally came upon a name for ourselves. I had, at some point, placed photos of us on a board, with our role initials under the pics- it spelled “PO”, “TL”, “AC”. It sounded like something pronounceable, and when Ingrid realised that Potlac can be pronounced Potluck, the name fastened itself and never came off. (As you might or might not know, a potluck is a meal to which everyone brings their own food for sharing with the others. A successful potluck requires some sort of coordination between the people who come to the meal, otherwise one will have all dessert and no mains.)

Benefits of Potlac

“That’s all very fine, and sounds nice. But what do you get out of it?”, you might ask. Well, it depends a bit on what responsibilities you have. Below we will each state the main unforeseen benefits we gained by working together in this fashion.

Mattias: The Team lead

Management can be a lonely job. While engineers can swarm around a problem, I often cannot do this- I can ask the team to do many things for me, but some things simply cannot be delegated or shared (I’m thinking career goals, personal confidences, salaries, etc). Though I can not share these things with my Potlac colleagues either, there are other topics I can and do share. Examples include discussing new and different ways of solving conventional problems, or discussing how to scale our team in a sustainable way. Often, it was through these discussions that I got a piece of the puzzle which helped me understand some problem I was working on.

Since we have an ongoing dialogue in Potlac, we have grown to know each other’s visions and our respective views on the state and history of the team. Through this, together with our different networks (both inside and outside the company), I get an advantage in that I see things on the horizon long before I would have otherwise. I can then prepare in time, and snuff out many problems before they become big ones.

Ingrid: The Agile coach

The Potlac gave me an opportunity and platform to have discussions around agile. A stage to dialog about servant leadership. A forum to find a consensus on what it means. It became an interface for the leadership team to focus on results and to hold each other accountable. It soon was a sandbox where conversations of empowerment, impediments and conflict were hashed out. It was also a classroom where we talked about approaches to stakeholder meetings, planning meetings, retrospectives and one-on-ones. It also developed into coaching sessions where we talked about our failures and what we learned. Instead of three individuals, each working toward our respective goals, we became a team- a leadership team with a united mission of supporting our engineers.

Ramon: The Product owner

Pushing for delivery can be just as lonely as managing a team. By working so closely together with both a Team lead and an Agile coach I gained a multitude of benefits. One of the most important ones is focus, it allows me to focus on delivery of enhancements to the products I am responsible for because I know that the other two equally important aspects of team leadership are covered by my two colleagues. Mystifying incidents of the past such as for example a sudden lack of commitment by an engineer for some time became a lot clearer with the added information from Mattias on the personal situation of that engineer. Ingrid opened up entirely new ways of handling typical issues with the team which helped me a great deal.

The second most important benefit I see is the typical thorny problem of avoiding (or repaying) technical debt. An open discussion between people representing the different interests involved makes it easier to approach this problem. Otherwise, it’s just an internal debate in a single person’s head.

We have seen how our initial experiments with Potlac brought new insights into our day-to-day work dynamics; unforeseen, and yet somehow expected. This way of putting ourselves in each others shoes has broadened our horizons, and given each of us more context when considering a problem.

At the end of the day, it’s really all about exposing context. Context around why a product is necessary now rather than later; Context on the background to a conflict in the team; Context to help select the right combination of agile methodologies for this particular team at this particular time.

DevOps helps the engineers practicing it to better understand the points of views of tangential groups of people. It exposes their needs, and the requirements set upon them. In a similar way, Potlac has helped the three of us by giving us context in a focussed high-bandwidth channel.

So… what now?

Though this has been an amazing ride, the environment within which we have worked is changing. And with it, we will need to adapt our methods in some way. The team, which consisted of nine people (plus us) will probably double in size during the coming year. The company is, as always, expanding and with this comes changing focus. Ingrid has been asked to work as an Agile coach elsewhere, and Ramon has taken on broader PO responsibilities. Mattias will have a flurry of new engineers in his team to manage. Each of us will need to form brand new Potlac groups in our new surroundings.

One question we ask ourselves is how Potlac will scale with the growth of an engineering team. With a larger number of people in the team, we will most likely have more products to develop and maintain. This implies more product owners. The Team lead will not be able to manage this many people- some sort of team split is on the horizon. Finally, we might need two Agile coaches, if the team grows to this size. Will Potlac function if its members grow from three to six?

Another important question we have considered is how difficult it would be to try to duplicate this leadership model. In software- if a hack helps solve an immediate problem, it is a good thing. However, to really get true value from the hack, it must be documented and portable. Is Potlac portable?

It would be convenient if we could produce a puppet recipe covering how to deploy Potlac in other teams. Alas, puppet does not cover this particular feature, and until that time comes, perhaps this article will help iron out what we did to get the results we described above.

Happy org-hacking!

PS: If you want to more about how we organise our whole tech organisation, see Henrik Kniberg and Anders Ivarsson’s paper on Scaling Agile at Spotify.

Analytics at Spotify

At the heart of Spotify lives a massive and growing data-set. Most data is user-centric and allows us to provide music recommendations, choose the next song you hear on radio and many other things.  We do our best to base every decision, programmatic and managerial, on data and this extends into the culture.

At my previous job, I developed software for Ad Agencies in the Digital Asset Management space, so you can say I was relatively new to “Big Data” as it were. New engineers at Spotify will notice that the culture has a way of engulfing you in a data-driven mindset. After working at Spotify for only a few months, I was talking about term weighting and signing up for internal courses on the R programming language.

I also participated in a hackathon where I developed a Spotify App code-named Genderify that tapped into our massive data-set to determine exactly how “manly” a playlist is. It was mostly a joke, but utilized listening data to provide an accurate statistical map of a playlist and displayed a result of 0-100, 100 representing an extreme edge case where a person registered as female had never listened to any tracks on your playlist.

Our Analytics Pipeline powers far more than satirical apps. It allows us to recognize trends, discover bugs, and analyze the effect of an event on a user and the entire ecosystem.

Analytics Tools

Internally, everyone (not just engineers) has access to three tools: Dashboards, Data Warehouse, and Luigi. Dashboards provides an interface similar to Google Analytics and allows users to create their own custom screens containing data they are interested in from our pipeline. For instance, we have dashboards that show us user growth in particular regions, or user engagement, or even the number of emails we deliver.

Data Warehouse is a more complex system that allows you to access our data-set directly. You can query the data, create map/reduce jobs using Hive, and even create mini data pipelines if that’s the kind of thing you’re into. For more complex operations, we have Luigi at our disposal, governing a zoo of Python, Pig and other animals which can be made to talk to any storage systems, run machine learning algorithms and even provide daily reports.

So what do we do with all this data? Pretty much everything. An example of an entirely data-driven decision would be our choice of a music recommendation algorithm that powers Spotify Radio.

Analytics Infrastructure

Most of our recurring data is added to our analytics pipeline by a set of daemons that constantly parse the syslog on production machines looking for messages we have defined along with the associated data for each message. Matching data is compressed and periodically synced to HDFS.  Typically data is available in our Data Warehouse and Dashboards within 24 hours, but in some cases data is available within a few hours or even instantly through tools like Storm.

So all this sounds… complicated. And I assure you, to build a pipeline and infrastructure like we have, it is. But to make use of it is actually really easy.  Engineers can easily add data to our analytics pipeline by adding a new message to our log parser and simply logging information to syslog using the correct format.

Becoming Data Driven

My experience at Spotify is a perfect example of how simple this is and shows how any engineer can make a meaningful impact.

Shortly after joining Spotify, we decided as a company that we wanted to send users emails telling them if their friends joined and if new songs were added to a playlist they subscribed to.  The hypothesis we wanted to test was that sending these emails would have a positive impact on user engagement and help more users to come back to using the app more often.

So… we needed a transactional email system.  I took this project on as an opportunity to learn Python. With the help of a few other engineers, we built a fairly simple system that had the ability to deliver a lot of emails and also provided a way for people to create new email templates and A/B test different versions of an email template.

Within a few weeks we knew which email templates worked best and, more importantly, we could see the impact these email campaigns had on our users.  We could clearly see that these emails were having a positive effect on user engagement.

So, how did we know the effect these emails had on users?

This backend system for sending emails would simply log a message every time an email was sent with the fields (username, timestamp, email-campaign, campaign-version).

Once this data made its way into HDFS, we had all the data we needed to determine the best performing email template for a campaign and we could track the effect a single email had on a user’s experience. We were able to see if an email had any effect on your listening habits, your account status and so on.

Powerful stuff.  This data is very much still in use today.

Remove Bias, Acquire Data

Spotify strives to be entirely data driven. We are a company full of ambitious, highly intelligent, and highly opinionated people and yet as often as possible decisions are made using data. Decisions that cannot be made by data alone are meticulously tracked and fed back into the system so future decisions can be based off of it.

How fantastic is that?  Sounds robotic, but humans cannot be trusted so it’s cool.

So the conclusion is to rely on data whenever possible.  Don’t have enough data?  Get more.  Make data the most important asset you have because it is the only reliable decision maker that can scale your company.

Snakebite: a pure Python HDFS client

As we all know, Hadoop is great and here at Spotify we are big fans of it. We use it to process data for a lot of different purposes like business intelligence, recommendations and reporting. But even though Hadoop is great at crunching data, interacting with it can be hard sometimes. For example, creating complex data pipelines is non-trivial and for that we created luigi.

Another annoyance we had with Hadoop (and in particular HDFS) is that interacting with it is quite slow. For example, when you run `hadoop fs -ls /`, a Java virtual machine is started, a lot of Hadoop JARs are loaded and the communication with the NameNode is done, before displaying the result. This takes at least a couple of seconds and can become slightly annoying. This gets even worse when you do a lot of existence checks on HDFS; something we do a lot with luigi, to see if output of a jobs exist.

On the programmatic side, there are a few workarounds that we can take. One is using HttpFs. This allows you to make REST calls over HTTP to retrieve information from HDFS, but this involves having yet another service running. And there is no nice command line interface for it either.

Another option is to use libhdfs, a C API for Hadoop, but the downside is that it still starts a JVM process. And if you want to use this from a different language (in our case Python) then C, you’ll have to write bindings for this.

So, to circumvent slow interaction with HDFS and having a native solution for Python, we’ve created Snakebite, a pure Python HDFS client that only uses Protocol Buffers to communicate with HDFS. And since this might be interesting for others, we decided to Open Source it at http://github.com/spotify/snakebite.

To show that it’s (much) faster, I ran a simple test against our production cluster:

wouter@foo:~$ time for i in {1..10}; do hadoop fs -ls / > /dev/null; done

real	0m14.464s
user	0m21.761s
sys	0m1.148s

wouter@foo:~$ time for i in {1..10}; do snakebite ls / > /dev/null; done

real	0m1.639s
user	0m1.072s
sys	0m0.160s

Snakebite currently contains a Python library (client.py), a command line client (bin/snakebite) and a mini cluster wrapper (minicluster.py). Since we wanted to have real integration tests, we wrote a wrapper around Hadoop’s minicluster that is started before tests are executed, but it might be useful in other scenarios as well.

Snakebite currently only supports actions that only involve the NameNode (like ls, rm, mv, stat, etc), but there are plans to also implement actions that also involve interaction with the DataNode.

The Snakebite repository can be found at http://github.com/spotify/snakebite and documentation at http://spotify.github.io/snakebite/

sthlm.js #7 @ Spotify

We recently hosted the seventh sthlm.js meetup at our office and Paul Lewis of Google Chrome, Robert Nyman of Mozilla and our very own Mattias Petter Johansson graciously agreed to give talks about topics they each feel passionate about.

At Spotify we are all about openness and sharing, so we recorded the talks for you. Now those of you who couldn’t make it can still take part of the meetup retroactively.

We who work here make use of JavaScript and web technologies in all kinds of environments, so for us it was great to learn about what the future holds in these areas. Hope the rest of you will enjoy it as much as we did.

Without further ado – here are the recordings. Enjoy!

Functional reactive programming (FRP) is a declarative approach to GUI design. The term declarative makes a distinction between the “what” and the “how” of programming. A declarative language allows you to say what is displayed, without having to specify exactly how the computer should do it.

Cheers!
Mattias Petter Johansson

Tools not rules. Really, just that. As Jake Archibald said to me when I was preparing the talk, many people want a quick fix to the problem of performance, just like a weight loss pill, but in reality what they really need is exercise. I’m advocating for the practice of using profilers, understanding what our code is doing, and fixing our actual bottlenecks.

Feel free to check out the slides, but you will need to watch the video to see the “how to” bits.

Read more…

Cheers!
Paul Lewis

We live in a world of walled gardens in the mobile sector, with a few major players having control over a vast majority of it. What Mozilla aims to do with Firefox OS is bring a low-cost smart phone to emerging markets and people who have mostly only been having feature phones before. It’s also about offering an option to and empowering developers to reuse their existing HTML5 skills and mobile web apps to work on mobile phones, without the need to learn another programming language or environment.

Then, if you want to optimize your apps on Firefox OS, we are working on Open Web Apps [1] and a lot of different WebAPIs [2] to make the web layer a lot more powerful, and with the aim of getting them standardized and implemented by all players. To test all of this, we have a Firefox OS Simulator [3], in the form of an extension to Firefox.

[1] https://hacks.mozilla.org/2013/02/getting-started-with-open-web-apps-why-and-how/
[2] https://hacks.mozilla.org/2013/02/using-webapis-to-make-the-web-layer-more-capable/
[3] https://hacks.mozilla.org/2013/03/firefox-os-simulator-previewing-version-3-0/

Cheers!
Robert

How we use Python at Spotify

The most frequent question we heard at PyCon this weekend, was how do we use Python at Spotify. Hopefully this post answers the question!

At Spotify the main two places we use Python are backend services and data analysis. Python has a habit of turning up in other random places, as most of our developers are happy programming in it.

Backend services

Spotify’s backend consists of many interdependent services, connected by own messaging protocol over ZeroMQ. Around 80% of these services are written in Python.

The non-Python services are typically written in Java, although we do have a few using C or C++.

Speed is a big focus for Spotify. Python fits well into this mindset, as it gets us big wins in speed of development. We also make heavy use of Python async frameworks to help services that are IO bound. Earlier services were written using Twisted, and in the last few years we’ve preferred gevent.

Some services are compute bound, and we’ve tried a range of strategies for how to handle this in Python. This has included performance testing, profiling, cython, and native libraries.

Data analysis

Spotify teams make heavy use of analytics, both in decision making and within the product itself. To simplify interactions with Hadoop, we use our Luigi package.

Luigi allows you to quickly build complex pipelines of batch jobs from your own machine. It handles the bundling of required libraries, and brings back any error logs to your local machine. This means you can quickly prototype complex data jobs.

We use Luigi, along with a range of machine learning algorithms, to power our Radio and Discover features, as well as recommendations for people you may want to follow. Simpler jobs power things like our top lists.

Around 90% of our map reduce jobs are written in Python. When it’s going all out we have seen over 6000 Python processes running over the hundreds of nodes in our Hadoop cluster.

Other uses

Spotify squads often use GraphWalker to do model based testing of both user facing clients as well as some APIs. To simplify the integration with our Python services, we ported the GraphWalker runner to Python.

Python is also used for prototyping services, quick scripts, build processes and more. There is even a Django app or two!

Community

Part of what makes Python so special is the community around it. Spotify is involved in the community in a number of ways.

We sponsor conferences such PyCon and Euro Python, provide support to local groups such as the Stockholm Python User Group and NYC PyLadies, host hackathons and contribute back to open source projects.

We are always interested in doing more for the community, so please get in touch if there is something we can help with.

Our team had a heap of fun at the recent PyCon and PyData. It was my first, and I had an amazing time meeting so many people and learning from both the talks and the hallway track!

If you’d like to work with Python at Spotify, we’re hiring in New York, Stockholm, San Francisco and Gothenburg. Or just drop by one of our offices and say hi!

Agile à la Spotify

Summary

At Spotify we have our own “Agile à la Spotify” manifesto to create alignment and direction for our improvement work. This blog post describes the background for creating the document, what it is and how we are using it.

Background

A little over a year ago my colleague Karin Björkén woke up early one morning and started writing “an agile manifesto for Spotify”. After observing and talking to different teams in different locations, it had become clear that we did not have an aligned view on what agile means to us as a company. We prided ourselves on our agile culture, but at the end of the day we weren’t really sure that we understood that to mean the same thing. Other agile coaches had similar observations so we loved the idea and wanted to work together on it.

We felt that the original Agile Manifesto wasn’t concrete enough, didn’t really paint an inspiring picture, and was not specific enough to our current context. We wanted something that could be used for guidance in our daily work, whether you’re an engineer or a manager, and that could be used in onboarding of new employees. We also wanted it to resolve some of the specific confusion over agile we had identified at the start.

With all this in mind we felt it had to be short and to the point, but still concrete enough for a n00b to understand what it meant in practice. After a couple of workshops and a few rounds of feedback, we came up with one short version (“the skinny”) and a kind of appendix on what it means for the Spotify culture. To avoid confusion we decided to call it “Agile à la Spotify”.

“Agile à la Spotify” is only one way of expressing our agile values and our culture. We also have an organizational design with the autonomous squad as a core concept. We do quarterly and bi-annual surveys with each squad to learn how we can support them. Some of the early suggestions for what to include in “Agile à la Spotify” were removed because they were already expressed in this survey. More on that in another blog post.

Agile à la Spotify (the skinny)

Continuous improvement

At Spotify, part of my work is to look for ways to continuously improve, both personally, and in the wider organisation.

Iterative development

Spotify believes in short learning cycles, so that we can validate our assumptions as quickly as possible.

Simplicity

Scaling what we do is key to Spotify’s success. Simplicity should be your guidance during scaling. This is as true for our technical solutions, as for our methods of working and organising the organisation.

Trust

At Spotify we trust our people and teams to make informed decisions about the way they work and what they work on.

Servant leadership

At Spotify managers are focused on coaching, mentorship, and solving impediments rather than telling people what to do.

What does this mean for the Spotify culture?

Continuous improvement

We can all contribute by:

  • Taking part in regular squad and project retrospectives
  • Being willing to experiment and trying new things
  • Promoting a culture of no blame and no fear
  • Seeing each change as an opportunity for improvement

Iterative development

We will:

  • Identify the smallest, simplest step that will help us learn
  • Demo and release often
  • Use A/B testing and other ways of gaining data-driven insight to verify our assumptions
  • Hypothesize, measure, analyze, learn, adjust

Simplicity

Remember that:

  • Simplicity allows for transparency, reuse and easy knowledge transfer
  • Removing complexity takes time and needs to be done iteratively
  • It’s important to have the hard discussions on how to do things simply
  • We favor direct communication and avoid unnecessary layers in communication
  • We don’t over-engineer and we don’t cut corners
  • We strive to avoid single points of failure

Trust

People are trusted to:

  • Support and make decisions to make Spotify a success
  • Politely question each other to improve Spotify
  • Figure out the best way to work with each other (which process to use)
  • Find their own solutions to complex problems
  • Identify and attack problems themselves, rather than seeing them as someone else’s responsibility

Servant leadership

We would like to see that:

  • Decisions are transparent
  • Managers encourage collaboration to solve problems rather than dictating a solution
  • Managers help to address impediments that the squad/chapter cannot solve themselves
  • You have regular one-on-one coaching and mentorship time with your manager
  • People development happens alongside Spotify’s success.

Impact

Agile à la Spotify has since been discussed in our tribes and tribe management groups, endorsed by our CTO (who was also active in developing it), used in onboarding, and has been discussed by new agile coaches and other colleagues. I don’t have the data to claim that it has made an enormous impact, but I believe the work of developing the document created alignment and an important communication tool for many culture ambassadors within Spotify. I believe that this, together with numerous other efforts, has helped bring Spotify closer to the vision outlined in Agile à la Spotify. And I hope we will continue to use it to this purpose.

Feedback

We would love to hear your thoughts on this. Maybe you have done something similar in your workplace or are thinking about doing it? Please share your thoughts and comments below.

Backend infrastructure at Spotify

Image

Introduction

In this blog post I will give an overview of how we are building our backend infrastructure at Spotify. Our backend infrastructure is very much work in progress – in some areas we have come a long way and in others we have just started.

In order to understand why we are building this infrastructure we need to cover some background on how Spotify’s development organization works. We are currently around 300 engineers at Spotify – and we are growing rapidly.

Background

Growth – At Spotify, things grow all the time. The number of daily users, the number of backend nodes that power the service, the number of hardware platforms our clients run on, the number of development teams that work with our products, the number of external apps we host on our platform, the number of songs we have in our catalogue.

Speed – As we grow we have to navigate carefully around a lot of things that could bring our development speed down. We make great efforts to eliminate dependencies between teams, and remove unnecessary complexity from our architecture.

Autonomous squads – One key concept at Spotify is that each development team should be autonomous. A development team (or ‘squad’ in Spotify lingo – see http://blog.crisp.se/2012/11/14/henrikkniberg/scaling-agile-at-spotify) should always be able to move independently of other squads. Even if there is a dependency between two squads there is always a way for the dependent squad to move forward. To enable all squads to make progress even though they have dependencies on another squad we have a few strategic ideas that we try to apply everywhere: our transparent code model and our self service infrastructure.

Transparent code model – All the Spotify code is available to all developers in an transparent code model. This means that all code in the Spotify client, Spotify backend and Spotify infrastructure is available to all the developers at Spotify to read or change. If a squad is blocking on some other squad to make a change in some code, they always have the option to go ahead and make the change themselves.

In practice Spotify’s transparent code model works by all teams sharing the same centralized git server. Each git repo has a dedicated system owner that takes care of the code and makes sure it does not rot. The transparent code model makes sure that everyone can make progress all the time, and that everybody has access to everybody’s code. This keeps Spotify going forward all the time and gives an positive and open work environment.

Self service infrastructure – All infrastructure that is needed should be available as a self service entity. That way, there is no need to wait for another team to get hardware, setup a storage cluster or do configuration changes. The Spotify backend infrastructure is built up of several layers of hardware and software, ranging from physical machines to messaging and storage solutions.

Open source – We try to use open source tools whenever possible. Since Spotify is constantly pushing the scalability limits of the software we are using in our backend we need to be able to improve the software we use in critical areas. We have contributed to many of the open source projects we use, for example Apache Cassandra and ZMQ. We use almost no proprietary software simply since we cannot trust that we will be able to tailor it to our ever growing needs.

Culture – At Spotify we believe strongly in empowered individuals. We reflect this in our organization with autonomous teams. For engineers there are many possibilities to move and try working in other areas inside Spotify to ensure that everybody stays passionate about their work. We have hackdays regularly where people can try out pretty much any idea they have.

Architecture

Image

Any architecture that needs to handle the volume of users that Spotify has need to partition the problem. The Spotify architecture partitions the problem in several different ways. Firstly, partitioning by features. A slightly oversimplified description is that all the physical screen area of all the pages and views in our clients is owned by some squad. All of the features in the Spotify clients belong to a specific squad. The squad is responsible for that feature across all platforms – all the way from how it appears on an iOS device or a browser via the real time requests handled by the Spotify backend to the batch oriented data crunching that takes place in our Hadoop cluster to power features like recommendations, radio and search.

If one feature fails, the other features of our clients are independent and will continue to work. If there is a weak dependency between features, failure of one feature may sometimes lead to degradation of service of another feature, but not to the entire Spotify service failing.

Since all the users are not using all the features at the same time, the number of users that has to be handled by the backend of a particular feature is typically much smaller than the number of users of the entire Spotify service.

Since all the knowledge around one particular feature is concentrated to one squad it is very easy to A/B test features, look at the data collected and take an informed decision with all the relevant people involved.

Feature partitioning gives scalability, reliability and an efficient way of focusing team efforts.

Backend Infrastructure

After partitioning our problem by feature, and giving a highly skilled cross functional squad the mission to take care of and work with that feature, the question now becomes, how do we build infrastructure that support that squad efficiently?

How can we make sure that the team can develop their features at breakneck speed without risking being blocked by other teams? How can our infrastructure solve the hard problems around scaling globally? I’ve already talked about our transparent code model, that always allows a team to go forward but there are other parts of the organization apart from the feature development squads.

In many organizations you have database administrators that take care of databases and their schemas, and you typically have to go through an operations department to get hardware allocated in data centers, etc. These special functions in the organization become bottlenecks when there are 100 squads simultaneously demanding their services. To solve this we are developing a backend infrastructure at Spotify that is fully self service. Fully self service means that any squad can start developing and iterating on a service in the live environment without having to interact with the rest of the organization.

To achieve this, we’ve needed to solve a range of problems across several different areas. I will cover a few important ones here.

Provisioning – When developing a new feature a squad typically needs to deploy this service in several locations. We are building infrastructure to enable the squad to decide for itself whether the service should be deployed in Spotify’s own datacenters or if the feature can use a public cloud offering. The Spotify infrastructure strives to minimize the difference between running in our own data centers and on a public cloud. In short you get better latency, and a more stable environment in our own data centers. On a public cloud you get much faster provisioning of hardware and much more dynamic scaling possibilities.

Image

Spotify clients connecting to their closest datacenter.

Storage – Most features require some sort of storage, obvious examples being playlists and the “follow” feature. Building a storage solution for a feature that millions of people will use is not an easy task, and there are a lot of things that have to be considered: Access patterns, failover between sites, capacity, consistency, backups, degradation in the case of a net split between sites etc. There is no easy way to fulfill all those requirements in a generic way. For each feature the squad will have to create a storage solution that fits the needs of that particular service. The Spotify infrastructure offers a few different options for storage: Cassandra, PostgreSQL and memcached.

If the feature’s data needs to be partitioned, then the squad has to implement the sharding themselves in their services, however many services rely on Cassandra doing full replicas of data between sites. Setting up a full storage cluster with replication and failover between sites is complicated so we are building infrastructure to setup and maintain multi site Cassandra or postgreSQL clusters as one unit. For people building apps on the Spotify API there will be a storage as a service option that will not require any setup of any clusters. The storage as a service option will be limited to a very simple key-value store.

Messaging – Spotify clients and backend services communicate using the following paradigms: request-reply, messaging and pubsub. We have built our own low latency, low overhead messaging layer and are planning to extend it with high delivery guarantees, failover routing and more sophisticated load-balancing.

Capacity planning – The growth of Spotify drives a large amount of traffic to the backend. Each squad has to make sure that their features always scale to the current load. The squad can choose to keep track of this manually by monitoring traffic to their services and identify and fix bottlenecks and scale out as needed. We are also building an infrastructure that allows squads to scale their services automatically with load. Automatic scaling typically only works for bottlenecks that you are aware of, so there is always a certain level of human monitoring that the squad need to handle. Our infrastructure allows the easy creation of graphs and alerts to support this.

Image

Insulation from other services – As new features and services are developed they tend to call each other in non trivial ways. It is very important that all squads feel that they can run at full speed while minimizing the risk of negatively affecting other parts of Spotify. To avoid this, our messaging layer has a rate limit and permissions system. Rate limits have a default threshold – this allows squads to call other services to try something out. If exceptionally heavy traffic is anticipated, squads would need to coordinate and agree how to handle this together. Different features are always run on separate servers or virtual machines to avoid having one misbehaving service taking down another.

Wrap up

As I mentioned in the beginning of this post, a lot of this is work in progress, and there are a lot of very interesting challenges coming our way. The view I’ve presented here represents a snapshot of how we see things at Spotify right now, and since we are addicted to continuous improvement, tomorrow we may well have changed some things…

And of course, if you feel this is interesting, have a look at our open positions.