Rock the night

It was the middle of the night, 10th of May 2011, and were at step twentysomething of the rollout plan, the step that basically meant no turning back. It was the night we’ve decided to do the switchover from the old to the new login system. It was scary, potentially causing downtime of login and account creation. But it was necessary.

Back in the old days we had a service called “users”. It had many responsibilities: it was used for account creation, it was required for logging in to Spotify, it saved some client settings server side. It even supported the restriction that you can only offline three different devices at the same time.It served us well up to a point where it became a serious scalability bottleneck. We had grown so fast that the service simply couldn’t handle the load, regularly leading to login failures. And no login means no Spotify.

Something had to be done, and the user2 service was created. Instead of a single database as the old system, it had a cluster of replicas. And now we were going to deploy it. The challenge: switching from users to user2, migrating all the account data, while minimizing downtime and disruption to the dozen or so other services depending on the service to function.

Not only was there thousands of lines of new code that’d get traffic, there was a new database as well. When that started to take writes it’d be difficult to move back to the old database if there were problems.

Lots of preparations led up to this night. A migration plan of no less than 56 steps had been created. The basic idea: migrate most of the data to the new database, shutdown writes to the old database, migrate the data that had changed since the last migration, point traffic to the new user2 service, enable writes to the new database. All while juggling with the other services minimizing their downtime.

We had pre-migrated data from the old database to the new, using the fact that all rows in the old database had a “last updated” timestamp (so we could effectively migrate a smaller delta up to the point where writes were disabled). We had prepared sanity checks. We turned off recurring payments in product-user which could be a theoretical problem. We shut down parts of the website. There were interesting firewall rules involved all over the place, disabling some write functionality.

Now we were doing the step in the migration plan that was basically the point of no (or at least, inconvenient) return: making the new system not only take reads but writes as well. This meant shutting down functionality of the service, causing user visible but brief outages of some functionality. Also, making the new database take writes would mean a very difficult situation in case we had to revert.

What happened next was roughly as follows: we made the old users database read only, then we migrated the rest of the data to the new database. This is when things got interesting. We had already migrated all data modified up to a point a couple of hours back in time. The calculations said migrating the remaining data would be very fast, minimizing downtime.

However, the old database was so severely overloaded that migrating the delta would take four hours. So we had to temporarily disable some functionality in the old users service, while still allowing some logins to function. Having done that, the migration went quickly, and we could proceed.

With all data migrated, we changed DNS to send traffic to the new user2 service, which then started to take logins. Win! With logins working, it was then a matter of sanity checking and enabling functionality for all the other services requiring data from the user2 service.

All our careful planning paid off. The deploy went well, with only a partial service outage during the middle of the night (during the migration phase). The only other hiccup was the service, being so well connected, hit a file descriptor limit of 1024 a few hours after deploy. That was a simple thing to fix, and nowadays our servers default to a much higher limit.

In the end, the deploy was considered a success. Which brings up an interesting point. The success of this 19 hour workday, involving several Engineers, was that so few people noticed the work we did. Of course, after the deploy of the new system there were less outages, leading to our users being less angry at us, which is of course nice.