Saturday, December 22, 2012

HDFS Source Notes

Source is the ultimate source of truth and for this reason I'm starting to dig into the source code of HDFS to get a deeper understanding of things. At first, the plethora of java classes seemed like a fuzz and it was pretty clear that it can't be taken all in together. I decided to spend a small time getting familiar with the code organization and then focus on specific workflows. The HDFS source that I'm digging into is for CDH4.1. This version has Quorum Journal Manager and that was the main motivation for me to dig into the source code. QJM is a very significant milestone for HDFS in my opinion. QJM eliminates the only Single Point of Failure(SPOF) in HDFS i.e. the namenode, in a manner that doesn't require special machines and set up. HDFS was designed to work with commodity hardware but so far the HighAvailability of NameNode required using an NFS mount on a filer. With QJM this NFS mount is no longer needed and the installation as well as management of HDFS is significantly simpler, plus it can all run on commodity hardware now.

Get back to my sojourns in HDFS code, I've been looking at the native stuff in HDFS today. By native stuff I mean stuff that's not done in Java but C/C++ and sometimes even in assembly. crc32 e.g. uses the corresponding SSE instruction when available. This would mean that crc32 calculations must be blazingly fast.

Another important thing that's done natively is native IO. Most interesting of which are the posix_fadvise calls. When datanode serves blocks it always reads the data sequentially. It tells OS about this using the fadvise calls so that the operating system can optimize for it. Operating System can optimize more by reading ahead more data.

Other native stuff includes various compression codecs such as snappy, zlib and lz4.

I totally agree with the choice of keeping the main source in Java and digging into native code for stuff that is performance critical. I like the idiom, simple things should be simple and complex things should be possible. I would have been even happier if the source was in Scala rather than Java but that's a different story.