Snakebite: a pure Python HDFS client

May 7, 2013 Published by Wouter de Bie

As we all know, Hadoop is great and here at Spotify we are big fans of it. We use it to process data for a lot of different purposes like business intelligence, recommendations and reporting. But even though Hadoop is great at crunching data, interacting with it can be hard sometimes. For example, creating complex data pipelines is non-trivial and for that we created luigi.

Another annoyance we had with Hadoop (and in particular HDFS) is that interacting with it is quite slow. For example, when you run `hadoop fs -ls /`, a Java virtual machine is started, a lot of Hadoop JARs are loaded and the communication with the NameNode is done, before displaying the result. This takes at least a couple of seconds and can become slightly annoying. This gets even worse when you do a lot of existence checks on HDFS; something we do a lot with luigi, to see if output of a jobs exist.

On the programmatic side, there are a few workarounds that we can take. One is using HttpFs. This allows you to make REST calls over HTTP to retrieve information from HDFS, but this involves having yet another service running. And there is no nice command line interface for it either.

Another option is to use libhdfs, a C API for Hadoop, but the downside is that it still starts a JVM process. And if you want to use this from a different language (in our case Python) then C, you’ll have to write bindings for this.

So, to circumvent slow interaction with HDFS and having a native solution for Python, we’ve created Snakebite, a pure Python HDFS client that only uses Protocol Buffers to communicate with HDFS. And since this might be interesting for others, we decided to Open Source it at http://github.com/spotify/snakebite.

To show that it’s (much) faster, I ran a simple test against our production cluster:

wouter@foo:~$ time for i in {1..10}; do hadoop fs -ls / > /dev/null; done

real	0m14.464s
user	0m21.761s
sys	0m1.148s

wouter@foo:~$ time for i in {1..10}; do snakebite ls / > /dev/null; done

real	0m1.639s
user	0m1.072s
sys	0m0.160s

Snakebite currently contains a Python library (client.py), a command line client (bin/snakebite) and a mini cluster wrapper (minicluster.py). Since we wanted to have real integration tests, we wrote a wrapper around Hadoop’s minicluster that is started before tests are executed, but it might be useful in other scenarios as well.

Snakebite currently only supports actions that only involve the NameNode (like ls, rm, mv, stat, etc), but there are plans to also implement actions that also involve interaction with the DataNode.

The Snakebite repository can be found at http://github.com/spotify/snakebite and documentation at http://spotify.github.io/snakebite/


Tags: , ,