Generating test cases so you don’t have to

In my spare time, I have developed a C++ framework for property based testing called RapidCheck. As my Hack Week project, I wanted to see if it could be a applied to the code I deal with at work: the Spotify player code. “But hey, aren’t there lots of C++ testing frameworks out there already” you say, “Do we really need another one?”. This one is a bit different and to explain why, let’s talk about tests.

“But I already have unit tests!”

Do your unit tests look anything like this?:

BOOST_AUTO_TEST_CASE(correctlyParsesNumbers) {
  BOOST_CHECK(parseSignedInteger("0") == 0);
  BOOST_CHECK(parseSignedInteger("3") == 3);
  BOOST_CHECK(parseSignedInteger("15") == 15);
  BOOST_CHECK(parseSignedInteger("1337") == 1337);
}

This is a typical unit test that tests a simple unsigned integer parser. It has four test cases that test that we get the correct output given a particular input. First we test with "0" which makes sense because that might be seen as the trivial case, gotta get that right. We then then we test with "3" because that’s a simple but non-trivial input. We also add a test for "15" since we need to ensure that we treat numbers in different positions correctly. However, only three test cases seem a bit weak so we add a fourth one for good measure. But why only four test cases? Is that enough? How much is enough? Shouldn’t we at least cover all the powers of ten that fit into the target data type? What about overflow? Are there any optimizations that introduce weird corner cases? How much testing is enough? If we assume that we are dealing with a 32-bit unsigned integer datatype, there are 4,294,967,296 valid inputs so exhaustive testing is not an option. We could double the number of test cases but that’s not really much of an improvement in the grand scheme of things. Besides, we all know that writing unit tests is boring. There, I said it. When reaching over 10 test cases, most of us will probably say “Whatever, it seems to work, let’s Ship It™”. It will probably pass code review and the code coverage report will likely be as good as ever.

Property based testing

In 2000, John Hughes and Koen Clasessen invented the phenomenal QuickCheck, a framework to solve this exact problem for the Haskell programming language. In QuickCheck, you specify properties of your code that should hold regardless of input. QuickCheck then randomly generates the input data to try to find a case for which the property fails. When such a case is found, it tries to find a minimal test case by shrinking the input data. This minimal test case is then reported for easy reproduction. This form of testing became known as property based testing and is actually the dominant form of testing in the Haskell ecosystem. Fascinated by the concept and surprised that nobody had made a worthy alternative for C++, I decided to create RapidCheck. RapidCheck is a complete implementation of property based testing for C++11 that implements all of the significant features of the original Haskell version and even more. Using RapidCheck we can rewrite the example above:

rc::check("correctly parses numbers", [](unsigned int x) {
  RC_ASSERT(parseUnsignedInteger(std::to_string(x)) == x);
});

Assuming we have a bug in parseUnsignedInteger, RapidCheck might print the following:

- correctly parses numbers
Falsifiable after 35 tests and 3 shrinks

std::tuple<unsigned int>:
(1000)

../examples/mapparser/main.cpp:25:
RC_ASSERT(parseUnsignedInteger(std::to_string(x)) == x)

Expands to:
0 == 1000

Apparently, we should have tried 1000 as input in our original unit test! This form of testing has several advantages:

  • It frees you from having to manually specify the test cases, the framework generates them for you.
  • It can find bugs for inputs that you didn’t even consider, it is not biased. – It can give more coverage with less code.
  • Encourages you to think about the complexity of its behavior – simple code, simple properties.
  • Encourages totality/robustness – Instead of filtering out unwanted input, write your code so that it handles everything.

“Sure, but my types are not just ints!”

In the example above, the generated input is an unsigned integer, a very simple value. However, real world code does not deal with just integers. Also, you will probably want more control over what input is generated in many cases, not just any integer will do. RapidCheck fortunately has a solution for all of this through the concept of “generators”. Generators are the source of all generated input in RapidCheck. A generator is an object of type Gen<T> where T is the type of the values that it generates. RapidCheck comes packed out-of-the-box with generators for most common types in C++ and the standard library. It also provides combinators that allow you to build complex generators from simples ones. To start with an example, rc::gen::inRange(0, 2000) returns a generator of integers (i.e. Gen<int>) between 0 and 2000. We can pick a value from this generator using the *-operator like this:

rc::check("correctly parses numbers", [] {
  const auto x = *rc::gen::inRange(0, 2000);
  RC_ASSERT(parseUnsignedInteger(std::to_string(x)) == x);
});

We can also create an std::vector of integers between 0 and 2000:

const auto ints = *rc::gen::container<std::vector<int>>(
                       rc::gen::inRange(0, 2000));

Or how about a generator for our own domain specific object type using the gen::build combinator:

using namespace rc;
const auto person = *gen::build(
    // Construct with ID, constructor takes std::size_t
    gen::construct(
        gen::arbitrary<std::size_t>()),
    // Pick a first name, Person class has a setter
    gen::set(&Person::setFirstName,
        gen::element("Joe", "John", "Jane", "Kate")),
    // Pick an age between 0 and 100
    gen::set(&Person::setAge, gen::inRange(0, 100)));

RapidCheck allows you to create generators for very complex types and you also get the test case shrinking for free when you use the built in combinators.

“Right, but my code has tons of state!”

In our first example, we were testing a pure function, a function that takes some input and produces output that only depends on the input. Unfortunately, real world code does not always consist of pure functions, it is littered with hidden state that is hard to manipulate directly. But even if our code is not a function, our test can still be. We only have to reconsider our notion the input is. RapidCheck includes a model based testing framework inspired by that of the Erlang version of QuickCheck. When using this, the generated input is not data but a sequence of operations to perform on the System Under Test. We also build a simplified model of the System Under Test that we can verify against while performing these operations. Thus, what the developer needs to implement are essentially two things: the model and the operations that can be performed on the model and the System Under Test. RapidCheck will then generate a valid (according to the model) sequence of commands and try to apply it to the System Under Test. If the post-conditions of any command fails, RapidCheck will, just like above, try to find minimal test case consisting of as few commands as possible with as simple parameters as possible.

rapidcheck

Thanks to Mats Lindelöf for the image

This type of testing is what I applied to the Spotify player.

Let there be bugs!

Spotify clients have a shared module that handles the sequencing of tracks (not the actual audio playback) including shuffling, repeating tracks, repeating playlists and so for. For the purposes of this article, let’s call it “the player”.

When it comes to the player, it turns out that the basic functionality does not require a lot of state to model. Here is the model that I used:

struct SessionModel {
  Context context;
  std::vector<ContextTrack> tracks;
};

struct PlayerModel {
  bool is_playing;
  bool is_paused;
  bool repeating_context;
  bool repeating_track;
  bool first_repeat;

  using SessionModelSP = boost::shared_ptr<const SessionModel>;
  std::vector<SessionModelSP> sessions;
  int session_index;
  int track_index;
};

Of course, the actual player has a lot more state than this but it also does a lot of things that we did not model here. The other part that we need to provide is all of the operations that we want to test. In RapidCheck, the operations are implemented as subclasses of the Command template. Here is the implementation of the preparePlay operation which prepares playback of a playlist (possibly prefetching etc.):

using PlayerCommand = state::Command<PlayerModel, PlayerSystem>;

struct PreparePlay : PlayerCommand {
  Context context =
    *gen::build<Context>(gen::set(&Context::entity_uri),
        gen::set(&Context::metadata),
        gen::set(&Context::pages,
            gen::map(genContextPage(gen::arbitrary<ContextTrack>()),
            [](ContextPage &&page) {
              return std::vector<ContextPage>{std::move(page)};
            })));

  PlayOrigin play_origin = *gen::arbitrary<PlayOrigin>();
  PreparePlayOptions options = *genPreparePlayOptions(context);

  void apply(PlayerModel &s0) const override {
    s0.sessions.push_back(boost::make_shared<SessionModel>(context));
  }

  void run(const PlayerModel &s0, PlayerSystem &sys) const override {
    sys.sessions.push_back(
        sys.player->preparePlay(context, play_origin, options));
  }
};

As we can see above, a command must implement two operations: apply and run. apply performs the operation on the model and run applies the operation on the actual System Under Test. In this case, the system is a fixture that has the player and all the related objects and mocks required for it to function. This operation also has three parameters provided as member variables of the command, context, play_origin, and options. These are all generated by RapidCheck and if the test case is shrunk, they will be shrunk too. Here is another command:

struct Pause : PlayerCommand {
  void apply(PlayerModel &s0) const override {
    RC_PRE(s0.is_playing);
    RC_PRE(!s0.is_paused);
    s0.is_paused = true;
  }

  void run(const PlayerModel &s0, PlayerSystem &sys) const override {
    EXPECT_CALL(*sys.track_player, pause()).Times(1);
    sys.player->pause(PauseOptions());
  }
};

This command uses the RC_PRE macro to assert preconditions for the command in the apply method. If the preconditions are not met, RapidCheck will discard the command and try another one. We also set a Google Mock expectation in the run method. If this expectation fails, RapidCheck will detect that as a failure and then try to find a minimal command sequence that still fails.

…and there were bugs

It’s not always that you’re happy to find bugs in production code but in this case, I made some very pleasant discoveries. I found a total of three bugs but one of them really stood out. RapidCheck printed the following:

Using configuration: seed=11317088442510877731
Falsifiable after 76 tests and 30 shrinks

std::vector<std::shared_ptr<const Command<spotify::player::PlayerModel, spotify::player::PlayerSystem>>>:
ToggleRepeatingContext()
PreparePlay({
  "tracks": [{
      "uid": "a",
      "uri": "spotify:track:aaaaaaaaaaaaaaaaaaaa"
  }]
})
Play(0)
AddTrack(0, 0, {
  "uid": "b",
  "uri": "spotify:track:aaaaaaaaaaaaaaaaaaaa"
})
SkipToNextTrack()

../spotify/player/cpp/properties/main.cpp:94:
RC_ASSERT(track == expected_track)

Expands to:
{
  "uid": "a",
  "uri": "spotify:track:aaaaaaaaaaaaaaaaaaaa"
} == {
  "uid": "b",
  "uri": "spotify:track:aaaaaaaaaaaaaaaaaaaa"
}

RapidCheck first prints out the counterexample which in this case is a textual representation of the command sequence that caused the failure. Then comes the assertion that triggered the failure. For someone not familiar with the player API, these were the reproduction steps:

  1. Turn on playlist repeat.
  2. Prepare playback of a playlist with a single track, let’s call this track A.
  3. Play the prepared playlist.
  4. Add a single track to the start of the playlist.
  5. Skip to the next track.

Since we have repeat enabled, we expect to wrap around to the newly added track when we advance from the first track. However, instead we get the same track again. This was even easily reproducible in both the iOS and Android clients!

Final thoughts

This was the first time I actually tried RapidCheck on a more complex real world system so I had no idea what to expect. I was very happy to see that it was a great success. While writing these tests, I did make a few observations:

  • Developing a model of the player made me really think about how the player is supposed to work since every misinterpretation results in a test failure. Because of this, every design decision and every peculiarity in the behavior is very evident in the model.
  • Simple bugs are almost always found after very few tests.
  • Test case shrinking is a really important feature. If shrinking is disabled, the counterexample for the bug above is huge and there is no way to know what is wrong just by looking at it.

With its roots in pure functional programming, one could get the impression that property based testing is merely a toy for academics. On the contrary, my experience during Hack Week is that it is an excellent tool to find complex bugs in complex real world code, even when that code is written in a programming language as inflexible as C++. This also echoes the takeaway other people at Spotify have had when out trying property based testing for other programming languages (such as the developers of the Spotify playlist system).

What remains to be seen is how this fits into our team’s existing testing workflow and this is something we want to experiment with. The next step is to enable this test in Continuous Integration and expand it to model more of the player state and operations. Regardless, if you care about the quality and correctness of your code, property based testing is definitely a tool that I think you should evaluate.

Emil Eriksson, The Players squad at Spotify

Switching user database on a running system

Switching PostgreSQL with Cassandra

Switching databases, Indiana Jones style.

Introduction

All Spotify users are now stored in a Cassandra database instead of Postgres. The final switch was made on May 11th, and we, the team responsible for user login at Spotify, would like to tell you a little bit about this endeavour.

Ode to PostgreSQL

In the past year Spotify has added more than 35 million new active users (20 Million Reasons to Say Thanks). User details, such as username, country, and email, are stored in a user database. Every time a user logs in that database is queried. Every time a user is created, upgrades to premium, accepts a license or connects to Facebook, this user database is accessed. This means that it’s a busy database and a core piece of the Spotify Infrastructure.

Our trusty Postgres database has been serving us well for years but it is now handling a few times the number of records it was originally designed for. And the dataset grows, every day, by an ever increasing rate.

Changing such a piece of infrastructure is scary. At the same time, we really didn’t know for how long Postgres would keep working.

Single point of failure

Our Postgres setup also suffered from another issue. Reads were distributed over all data centers, but writes only took place in London. On a single machine. To quote one of our network engineers, Loke Berne: “From a network operations perspective it’s so nice to get rid of the horror rack once and for all”. He was referring to the rack containing the user database write master. To have a single machine being the master of the user database is not fun. All changes made to user information, such as new accounts being created and users upgrading to premium for 75 million active users was handled by that single poor machine. A failure of that machine would have resulted in all those operations failing until the hot standby machine was promoted to master and traffic was redirected.

The inevitable

As the team responsible for the login service we had known for quite some time that the existing solution would not scale with our user growth. We could feel it in our bones that the old mare might last the winter, but next one could get rough. We all knew it was time to put it down but pulling the trigger was scary.

"You want to do what!?!"

“You want to do what!?!”

The shark attack that started it all

In September 2013 the Atlantic cable between our datacenter in London and the one in Ashburn broke. Rumors said that a shark ate the cable. Regardless of the cause, it resulted in a major drop in new users during a week. If new users could have been created in other sites, this network problem would have been much less of an issue.

We already knew that having a single point of failure in London was a problem, but that week in September made it very clear that it was not only a theoretical failing of our design. The single point of failure had a tangible business cost that could easily be measured in euros and dollars. We had thought about using Cassandra for a long time but had never gotten around to focus on it. It got very clear that it was time to find a solution.

Shark biting cable

Note: Blaming the netsplit on a shark may or may not be purely fictitious. Any connection with this shark and these events are purely coincidental.

Changing Engines on the Running Car

“When you do things right, people won’t be sure you’ve done anything at all.”

– God entity from the “Godfellas” episode of Futurama.

This quote applies quite well when performing infrastructure changes. It’s not always feasible but it’s something to aim for. Data migration can be especially tricky. We didn’t want to shut down the entire system while we migrated as that would break login and account creation for our users. Instead we had to do a seamless transition. This meant running with both storage solutions for a while, using Postgres as master storage while darkloading the Cassandra storage. By darkloading we mean handling production requests in parallel, ignoring the results.

This had multiple benefits:

  • We made sure our new storage solutions had the right capacity for handling the load. We knew the capacity needs for Postgres already, but Cassandra is different so it needed to be measured on its own.
  • We were actually running the Cassandra storage code, so we could find all the bugs, robustness and scalability issues before we made the switch.
  • Since we darkloaded, it was ok for Cassandra to fail. The main process would log the error but ignore the results.
  • We could keep the data in sync between the old and the new storage system.

Migrating existing accounts

In addition to darkloading writes we also had to migrate all existing users. To do this we ran a job that iterated over all users and copied them from Postgres to Cassandra. We had to make sure to minimize race conditions, a darkload write could happen at the same time as an account was migrated.

Because of how Postgres works (replication stops during a long running query) we had to migrate in a special way. We made sure that all writes, both new accounts and account updates, were properly darkloaded before we ran the migration script.

  1. Create a stream of usernames by doing a long running query against a read slave.
  2. For each username:
    1. If the username exists in Cassandra don’t migrate. Remember we assume approximately 100% darkloading.
    2. If the usernames doesn’t exists we want to migrate it.
    3. Query another read slave to get the user data for this username (Remember that replication stopped during a long running query, so querying the first slave would mean that we would risk getting stale data if the user had mutated after the long running query started)
    4. Fetch data from read slave 2 and insert it into Cassandra (from this point and on all mutations that are aimed for this username will end up in both Postgres and Cassandra)
Migration in the good old days.

Migration in the good old days.

Verifying correctness

We also needed a script to verify that the storage systems were in sync. This did a similar iteration of users, fetched from both storages and compared and spat out useful statistics about which kinds of differences we saw along with their frequencies. This turned out to be very useful for finding bugs.

Switching

When doing the actual switch we were helped by the microservices architecture that Spotify employs throughout it’s backend. The idea is that actual calls to the user database is wrapped in a RESTful service. That way only a single service needs have knowledge of the specifics of the storage layer. The actual switch then simply meant:

  1. A configuration change to switch the master / darkload role.
  2. Making sure this new configuration has updated on all service machines.
  3. Restarting all the service instances at the same time.

Restarting all the instances at the same time is important as this minimizes the risk of conflicts. What if we would have some services running with Postgres as master and some with Cassandra as master? We might get conflicting accounts created. The same username could point to different accounts!

Restarting all instances at the same time does have some downsides. During the restart, which lasts for a couple of seconds, it was impossible to create accounts or login. Fortunately our clients are clever and will automatically retry login.

Doing the switch this way also gave us a good rollback plan. If something went wrong we could fairly easily revert to using Postgres as master in the same way.

So how did it actually turn out?

First, we made sure we were in a good state. Minimizing the number of current inconsistencies by running verification scripts. We then flipped the switch and looked at logs and graphs. And then there was silence. We didn’t really see anything exciting or surprising happening. In fact, it was one of the most boring deploys we’ve done. The only thing left to do was to manually trigger the repair of a few remaining inconsistencies.

Bumps along the road

During the Cassandra migration we bumped into a lot of obstacles. We’ve probably forgot half of them, but we would like to explain some of the bigger ones we had to handle.

Paxos broken?

Cassandra introduced a fancy feature called LWT “Lightweight Transactions” or “conditional inserts” in version 1.2. It basically allows you to guarantee key uniqueness. In the Spotify realm it is important that one username belongs to one user and one user only. We don’t want allow collisions/races if two users concurrently creates an account and decides to have the same username.

So to avoid collisions we decided to use LWT, that under the hood is using a famous distributed consensus algorithm called Paxos. The Cassandra implementation requires a quorum of the replicas to be available and involves four round trips, i.e a quite expensive operation.

We did a benchmark and came to the conclusion that using Paxos when we create new accounts wouldn’t be too expensive. When we tried this live we noticed that a lot of the account creations failed and Paxos claimed that the username/account already existed.

We filed an upstream ticket CASSANDRA-9086 and waited for replies.

The response was that this was an intended behaviour, and we were able to handle this in our service code.

Paxos requires all nodes for CAS

This was a funny bug. As stated earlier, the Paxos algorithm requires a quorum of nodes to be able to reach consensus. If that fails, the operations fails and you get back some additional information. E.g. you get back the number of nodes that were able to take part in the Paxos round.

We noticed that some account creations failed and that the error messages stated that all replicas were needed, not quorum number of replicas. Turned out to be a bug (CASSANDRA-8640) in the Cassandra version we were using. So upgrading the cluster simply solved the issue.

Java driver (revert of JAVA-425, in 2.10.0)

We use an open source Cassandra client built by Datastax, the company that employs most Cassandra committers. After a couple of weeks, and some additional load added, we noticed that the number of connections from our service to the Cassandra cluster dropped, and new connections were not re-established.

This bug turned out to affect live traffic. We use the asynchronous java driver API, not using a separate thread pool. When no new connections were established, the API started blocking. Consequently we hit our concurrency limit and alerts were triggered off hours. Fortunately manual restarts kept the service going while we tracked down the root cause.

Turned out to be a new regression (introduced in JAVA-425) in the client version we were using. Upgrading to latest version, which had reverted JAVA-425, fixed our issue.

Dump dependencies

When moving to Cassandra we also needed to start making daily dumps of the Cassandra database. This is something that is/was being done from the Postgres database, as a dependency for a lot of big data batch jobs (used for personalisation, business analytics, etc.). Dumping from Cassandra introduced subtle differences from the Postgres dump, and not all batch jobs were able to handle the differences. This is something that is still not sorted 100%, but are actively being worked at.

Wrapup

Software projects are being known for taking longer time than expected, and this was no exception. But we did manage to make the switch almost without any excitement. Normally when things just work, we sometimes can’t stop to think that something must be horribly wrong. This was not the case this time, as we hit a lot of the problems in the darkloading phase almost without affecting end users. We also ran verification scripts, tailed logs, and looked at graphs so many times. So by the time we made the switch, we were confident that no boulders would come rolling our way.

This blog post was presented to you as a cooperation from the Login Team at Spotify:

Johanna Albinsson (hannis@spotify.com)
Nick Barkas (snb@spotify.com)
Kristofer Karlsson (krka@spotify.com)
Magnus Melander (magnusm@spotify.com)
Roger Schildmeijer (roger@spotify.com)
Marcus Vesterlund (mzeo@spotify.com)
Alan Wright (alan@spotify.com)

RAMLfications – Python package to parse RAML

A few of us at Spotify are infatuated with RAML – a RESTful API Modeling Language described as “a simple and succinct way of describing practically-RESTful APIs”, extremely similar goal of Swagger.

I’m pleased to announce the initial release of RAMLfications, a Python package that parses RAML and validates it based on the specification into Python objects. Continue reading

Understanding the Spotify Web API

Six months ago, when we launched our Web API, we provided twelve endpoints through which developers could retrieve Spotify catalog data. Today the API has 40 distinct endpoints and more are being added all the time. In this post, I’d like to take you on a brief tour of the API and show you some of the programs that have already been developed with it. Continue reading

Diversify – Creating a Hackathon with 50/50 Female and Male Participants

There have been many efforts during the past few years to raise awareness of gender equality in the IT industry. But it’s been slow progress – and we don’t like slow! So we decided to do things differently. Instead of just hoping to achieve gender equality, we made it a requirement for our hackathon “Diversify”.
1 Continue reading

Personalization at Spotify using Cassandra

 

By Matt Brown and Kinshuk Mishra

At Spotify we have have over 60 million active users who have access to a vast music catalog of over 30 million songs. Our users have a choice to follow thousands of artists and hundreds of their friends and create their own music graph. On our service they also discover new and existing content by experiencing a variety of music promotions (album releases, artist promos), which get served over our ad platform. These options have empowered our users and made them really engaged. Over time they have created over 1.5 billion playlists and just last year they streamed over 7 billion hours worth of music. Continue reading