Achieving fault-tolerance with intelligent daemons

Powering the Spotify service is a backend of dozens of different, specialized service implementations. For example, we have a playlist system that is responsible for the storing and retrieval of users playlists, a user service responsible for login, and we have an ad system with advertisements and the surrounding business logic.

Spotify’s backend infrastructure and service reliability engineers spend a great deal of time improving and tuning the fault tolerance of these systems. To give our users the best experience we want to build backend service model that scale inline with user growth and that minimize downtime and failures.

In this blog post I will discuss how the Spotify backend handles dependencies in the presence of different kinds of failures. Specifically, I will discuss how one backend service can achieve reliability even when services it depends on has performance problems or is completely down.

At scale, failures happen

The Spotify backend consists of thousands of servers. At this scale hardware problems becomes a common occurrence, and all systems must be designed with this in mind. Sometimes the service that has the data you need will not respond, or respond slowly because of a hardware failure issue or similar. In addition, the Spotify server park is spread out over several data centers around the globe. Sometimes the network connectivity between datacenters may be temporarily slow or down entirely. This is especially true for requests that have to traverse the Atlantic.

As our backend servers are specialized components, it’s common that an incoming service request may in turn generate one or more service requests to another service or part of the system. For example, a request to get which ads to display to a user (a request to the ad system) may imply a request to the login system, and possibly other systems as well.

Naturally, we want to avoid cascading failures: a hiccup of one system should not take down or overload another system. In addition, failures should be handled as gracefully as possible, and preferably not affect the performance of the calling service.

Service degradations and effects

Fundamentally, there are three different ways in which a service can be degraded:

  • A service can fail to respond completely. This is sometimes known as omission failures. The typical example would be failure to set up a TCP connection if the host is down or not accessible due to firewall restrictions for example.
  • A service can be slow to respond. A typical example would be communicating with a server that is overloaded.
  • A service can be corrupt: it sends back incorrect or corrupt data. We will ignore this failure scenario in this article.

It is also worth differentiating temporary glitches from longer service degradations:

  • A service can be in a degraded state for a short period of time. These transient failures typically happen due to a network glitch (typically resulting in omission failures) or a temporary spike in traffic (typically causing the service to become slow).
  • A service can be in a degraded state for a longer period of time. This could be the result of a hardware failure or more significant network issue.

How does this affect other services? Consider the relationship where service A depends on service B, meaning that when A gets a request it in turns makes a request to service B.  We can say that A has a strong dependency on B if the response from B is required to complete the request, whereas a weak dependency allows requests to fail.

  • On omission failures from B, where B is a strong dependency, A can fail its request. If B is a weak dependency, it can respond with a partial or incomplete response (perhaps with fallback values filled in).
  • If B becomes slow, requests from A will also become slow, up to the timeout. A may also hit concurrency limitations (this is a very complex topic, and will not be covered extensively in this article).

Handling failures

On a very low level, handling network failures requires proper software engineering practices such as handling exceptions and error codes from the networking libraries, and cleaning up resources properly etc. This is still surprisingly hard to get right; at Spotify we have had several problems with common libraries leaking memory or file descriptors.

Especially for weak dependencies, it can make sense to fast-fail requests. If A sends 100 requests to B, and all of these requests fail or are slow, it is a reasonable assumption that the 101st request will fail or be slow. Spotify uses the common “circuit breaker” pattern to solve this. The outcome of each request to a server modifies caller’s notion of the server’s health status. If the health of a particular service, as understood by the client, falls below a certain predefined threshold the circuit breaker will trigger. In this case most requests with return with an error directly on the client side, without ever reaching the troubled server. Only a small fraction of the requests will trickle through to detect if the service is getting healthier.

When doing fast-failing and the health monitor has determined the service is reliable again, it is important to reintroduce new service requests slowly and in a controlled manner, instead of allowing all requests through at the same time, potentially overloading the service.

Daemonizing the intelligent client

Companies or individuals building large distributed systems eventually build shared infrastructure for network communication. It’s common to have a networking library supporting features such as dependency management, health monitoring, circuit breaking and metrics gathering. These “intelligent” libraries are supposed to solve these difficult problems, so they don’t have to be re-invented every time a new service is built.

At Spotify we have had good success with another approach, and that is to put all these features (circuit breaking, health monitoring etc) in a separate daemon. Instead of service A sending a request to B (possibly to a different data center), it will send a IPC request to a daemon running on localhost, which in turn handles the communication with B.

At Spotify we call this daemon Arrow, a naming that springs from the fact that service topologies are often drawn as a bunch of boxes and arrows. It works like this:

  • Arrow is a daemon running on the same machine as the service.
  • Instead of service A having a client for B, it has a (very thin) client for communication with Arrow.
  • Arrow will relay the request to service B.

It has the following features:

  • Arrow has several backends, such as an HTTP.
  • It has a connection pool.
  • Arrow has health monitoring to keep track of the health of the individual servers that make up B.
  • Arrow has a circuit breaker for fast failing requests.
  • Arrow collects lots of metrics about requests, such as request times, request rates, request failures, number of incoming connections, health of nodes etc.

Having Arrow as a separate process instead of a client library has some benefits.

  • For services consisting of multiple worker processes, having one Arrow means a shared view of the world (such as the health of a service). Additionally, the connection pool can be shared, resulting in less resource utilization. With workers running individual clients, each worker would have their own view of the world.
  • The process can be debugged and inspected (netstat, lsof, strace, gdb etc) with less noise than if inspecting the main service running the client.
  • The process can be restarted, reloaded or killed with less effect or impact on the main service. This has been proven to be useful for deployments, upgrades and also shutting down traffic between services.
  • Configuration is simplified as Operations has one configuration file to care about for dependencies and networking (you could of course have a configuration file for your client library, but that concept is a bit unusual).
  • From a network-engineering standpoint, it is good practice to reduce the number of different programs that can cause network packets being sent between data centers.
  • Aesthetically, having one program that does one thing and does it well instead of monolithic processes is very Unix, which we like at Spotify.

Of course, the process comes with some disadvantages as well. There is some overhead from doing the IPC instead of sending a request immediately. Thankfully, this this has not been a problem for us to date.

Symmetry: Preventing failures from the other side

At Spotify we run Arrow when we want to reduce the effects of a depending service being degraded. However, another side of the coin is preventing one service from overloading another, which can cause failures and slowness.

To achieve this, Spotify has a load-balancing configuration in front of certain services for rate limiting traffic and blocking any service that is generating too many requests. For example, most of our HTTP servers have an Nginx in front of it, acting as a load-balancer and (sometimes) rate limiter.

This additional component has many of the same benefits as running Arrow as a separate process, and there is symmetry to it as well:

  • The load balancer is for controlling incoming connections.
  • Arrow is for controlling outgoing connections.

A specific benefit to the load balancer that is not listed above is that for services that have multiple workers, it reduces the number of ports the server is listening on, which simplifies service discovery.

Conclusion

Intuitively, adding extra components or “cogs” should be avoided as it can make your service more fragile and complex. However, while each individual server evolves in complexity, it reduces the complexity of a cluster of services. On the service level, having Arrow as a process instead of a library reduces the code complexity of the service program.

We have had great success with Arrow in production. We have an implementation of Arrow running on the login servers for communicating with the user-data servers. The login service has a weak dependency on user-data, and will quickly start to fast-fail requests if user-data gets slow or stops responding.

If you find this topic interesting, the Infrastructure team is currently hiring