Saturday, December 22, 2012

HDFS Source Notes

Source is the ultimate source of truth and for this reason I'm starting to dig into the source code of HDFS to get a deeper understanding of things. At first, the plethora of java classes seemed like a fuzz and it was pretty clear that it can't be taken all in together. I decided to spend a small time getting familiar with the code organization and then focus on specific workflows. The HDFS source that I'm digging into is for CDH4.1. This version has Quorum Journal Manager and that was the main motivation for me to dig into the source code. QJM is a very significant milestone for HDFS in my opinion. QJM eliminates the only Single Point of Failure(SPOF) in HDFS i.e. the namenode, in a manner that doesn't require special machines and set up. HDFS was designed to work with commodity hardware but so far the HighAvailability of NameNode required using an NFS mount on a filer. With QJM this NFS mount is no longer needed and the installation as well as management of HDFS is significantly simpler, plus it can all run on commodity hardware now.

Get back to my sojourns in HDFS code, I've been looking at the native stuff in HDFS today. By native stuff I mean stuff that's not done in Java but C/C++ and sometimes even in assembly. crc32 e.g. uses the corresponding SSE instruction when available. This would mean that crc32 calculations must be blazingly fast.

Another important thing that's done natively is native IO. Most interesting of which are the posix_fadvise calls. When datanode serves blocks it always reads the data sequentially. It tells OS about this using the fadvise calls so that the operating system can optimize for it. Operating System can optimize more by reading ahead more data.

Other native stuff includes various compression codecs such as snappy, zlib and lz4.

I totally agree with the choice of keeping the main source in Java and digging into native code for stuff that is performance critical. I like the idiom, simple things should be simple and complex things should be possible. I would have been even happier if the source was in Scala rather than Java but that's a different story.

Monday, November 26, 2012

How to add gist to blogger

  1. Create a gist e.g.
  2. Create the blog post
  3. switch to html view
  4. Embed a script tag where src attribute is url of the git with .js extension like this
    <script src=""></script>

Create EC2 instance with name

Here's a small bash script to launch an EC2 instance with a particular name. The idea is to launch an instance and get its id from stdout, then apply the name tag to instance. Pretty simple.

instance_id=`ec2-run-instances -n $num_instances -g default -k keyname -t m1.medium -z us-east-1d ami-3d4ff254 | sed -n 2p | awk '{print $2}' `
echo "Created instance with id $instance_id"
ec2addtag $instance_id --tag Name=$instance_name
echo "Renamed instance $instance_id to $instance_name"

Monday, November 19, 2012

apt-get commands

The concept behind apt-get is simple. Repositories are websites that provides access to a bunch of software. Each software is called a package and has a distinct name. To be able to download and install a software(package) you need to know of at least one repo(repository) that provides that software. For this purpose you need to add repositories to your local apt-get sources list. Once you have the repos set up installing software is a piece of cake. Not only can you install software with a single command line but also ensure that it's well set up. apt-get figures out the dependencies of the software and installs them automatically. It manages all the various dependencies of all the software on your machine, letting you concentrate on what you really want which is using the software. Sounds neat doesn't it, its almost too good to be true. Let's see the commands that the above workflow boils down to.

I've borrowed as it is from the following webpage:

Repo addresses are stored in list at /etc/apt/sources.list. This is where apt-get looks for repos to find the software you ask it for. Entries look like this:
deb  [web address] [distribution name][maincontribnon-free]

deb breezy main restrcted

If you add a  repo make you call the following command to update the local apt database with all the available software on the new repo:
apt-get update

To search for a software in local database:
apt-cache search baseutils

Now, to the most important command. Here is how you install stuff:

apt-get install baseutils

Sometimes you may need to install deb packages directly for those times:
dpkg -i gedit-2.12.1.deb

Another useful command is one to list packages on the machine
dpkg -l gcc*

It's also very useful to know what files actually got installed for a particular package, for that:

Saturday, October 20, 2012

Implementing resource access blocks in scala

Recently, I came across a scenario where I needed to provide a service that internally needed to needed to open a socket connection. Now, socket connection is a resource that needs to be closed when done. I could open and close a connection for every call that user made to the service but that is also not optimal. I needed to give the user a way of using the same connection across multiple calls. In a language like C++ this would have been done by creating a socket connection in the constructor and closing the connection in destructor but we don't have destructors in the java world. In this situation, I found it best to go with the resource block solution. The idea is that user can supply a block to the library. In that block the user will have access to the service in question. Library promises that same underlying resource/resources will be shared across all the calls inside the block. This makes it possible for the library to allocate a resource before invoking the block and clear it afterwards.

Enough of rants, let's see some code:

object MyLibrary {

def apply[T]( cb: (ServiceInterface) => T): Option[T] = try {
    val service = new Service
    val result = cb(service)
} catch {
  case _ => None


trait ServiceInterface {
  def serviceMethod1()
  def serviceMethod2()

class Service private() extends ServiceInterface {
  private val connection = openConnection()

  def serviceMethod1() = {...}
  def serviceMethod2() = {...}

  def shutdown() = closeConnection()

Ok, lots going on here. Before we take apart the above code let's see how the library can be used

MyLibrary{ serviceInterface =>

We don't want client to directly instantiate Service so we make Service's constructor private. We just want the client to know about the service interface, hence we pass a parameter with that signature into the block. The block itself is defined using the apply method on the companion object which provide convenient block invocation syntax i.e. MyLibrary { ...}

In my case I also wanted to make sure that client never receives an exception but gets informed about success or failure using an option, hence the return type of callback is option of T where T is the return type of user supplied block. Returning an option is sort of an important point. We want the user to be able to use a block that returns a value otherwise we'll force user to write a block that work with side effects which is not a good idea. It would have been easier if we could return the exact return type of the block from the library but we can't. That's because we may fail in resource initialization i.e. even before we call the block and this failure possibility needs to be reflected in the return type.

Thursday, October 4, 2012

Eclipse High-Res on Retina

Eclipse looks very ugly on new Mac Retina displays by default. It can be made to run in High Res very easily though. Here's what you need to do:

  1. Context Click Eclipse app icon and select show package contents
  2. Open Contents/Info.plist in any text editor
  3. The file ends with
just before that add this: NSHighResolutionCapable
  • Go back to the folder that has Eclipse app icon
  • Copy paste Eclipse app folder (icon) to create a duplicate, remove the old one and rename the duplicate with the original's name. This is required for eclipse to read the changes made to Info.plist
  • Launch Eclipse and enjoy the shiny High Res display
  • Wednesday, September 12, 2012

    Screen Surprise

    A few days back, I lost reference to a Spark( job because my ssh session to the remote machine got disconnected. My colleague was surprised that I wasn't using screen. I was surprised at his surprise because I had heard of screen exactly once before that day. I had no idea it was that popular. I decided to give it a try today.

    To my surprise screen was already installed on the remote machine I tried it on, it must be really popular after all. Having used it for just about 30 minutes I can see why it is so useful. Sounds like a must have tool. There is enough information about it on the web so I will just provide a few commands here:

    # screen
    Creates a new screen window

    #screen -ls
    Lists available screen windows

    #screen -r
    Connect to a screen window

    Once in screen window you can do Ctrl-A d to detach from the screen, it keeps running in the background.

    The most important use case for me is that even if my ssh session drops accidentally the screen session keeps running on the machine and I can reconnect to it on my next login.

    Sunday, September 2, 2012

    Easy start with sbt

    I'm loving scala and I can see myself creating a lot of sbt projects in the future. To avoid writing the boilerplate structure of sbt project everytime I start I created this simple bash function to do that for me:

    function create_sbt_project() {
      mkdir $project_folder
      pushd $project_folder
      git init
      echo -e "name := \"$1\"\n\nversion := \"1.0\"\n\nscalaVersion := \"2.9.1\"" > build.sbt
      mkdir -p src/main/resources
      mkdir -p src/main/scala
      mkdir -p src/main/java
      mkdir -p src/test/resources
      mkdir -p src/test/scala
      mkdir -p src/test/java
      echo -e "object Hi {\n  def main(args:Array[String]) = println(\"Hi!\")\n}" > src/main/scala/hw.scala
      echo target/ >> .gitignore
      git add .
      git commit -m "Initial Commit"
      sbt compile run clean

    $ create_sbt_project test

    will create a folder with standard sbt project structure that compiles and runs.

    Saturday, September 1, 2012

    Syntastic Scala fix

    Syntastic is an awesome vim plugin. I've been programming in scala recently and syntastic works with scala which is awesome, except one issue which has been annoying me. Since syntastic uses "scala" on the command line to find errors in the script it considers the package statement at the start of the script as an error. "scalac" is a better command to find the errors but when compiling a file in isolation it compains about imports from dependent libraries. A reasonable option seems to be stop scalac after the parsing step.

    Here's the patch:

    diff --git a/syntax_checkers/scala.vim b/syntax_checkers/scala.vim
    index f6f05af..941054d 100644
    --- a/syntax_checkers/scala.vim
    +++ b/syntax_checkers/scala.vim
    @@ -24,7 +24,7 @@ if !exists("g:syntastic_scala_options")

     function! SyntaxCheckers_scala_GetLocList()
    -    let makeprg = 'scala '. g:syntastic_scala_options .' '.  shellescape(expand('%')) . ' /dev/null'
    +    let makeprg = 'scalac -Ystop-after:parser '. g:syntastic_scala_options .' '.  shellescape(expand('%'))

         let errorformat = '%f\:%l: %trror: %m'

    Basically just set makeprg to 'scalac -Ystop-after:parser '. g:syntastic_scala_options .' '.  shellescape(expand('%'))

    Friday, June 29, 2012

    Tabular format output in bash using column -t

    Often, commands in bash produce data that contains many similar lines that contain same fields with different values. Many times the data across these lines is not aligned because of different number of characters in values. In such scenarios column -t can help print this output in a pretty tabular way.

    pankaj-> mount
    /dev/disk0s2 on / (hfs, local, journaled)
    devfs on /dev (devfs, local, nobrowse)

    pankaj-> mount | column -t

    /dev/disk0s2  on         /     (hfs,    local,    journaled)
    devfs             on         /dev  (devfs,  local,    nobrowse)

    As you can see piping output of mount to column with -t option prints it in a tabular format that is much more readable. By default column -t uses space as table column delimiter but it can be changed to anything else very easily:

    pankaj-> cat /etc/passwd
    nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
    root:*:0:0:System Administrator:/var/root:/bin/sh
    daemon:*:1:1:System Services:/var/root:/usr/bin/false

    pankaj-> cat /etc/passwd | column -t -s:                                                                                                               nobody                                           *  -2  -2  Unprivileged User              /var/empty          /usr/bin/false
    root                                                 *  0   0   System Administrator           /var/root           /bin/sh
    daemon                                           *  1   1   System Services                /var/root           /usr/bin/false

    Here the colon(:) has been used as table delimiter.

    Friday, June 15, 2012

    Bash: switch folders quickly with $CDPATH

    Most unix users have a few favorite folders for storing source repositories and other stuff they work on daily. There is a feature in Bash that makes it very easy to switch to these folders and it's very simple to use. Just add these folders to the CDPATH environment variable.

    e.g. put this in your .profile 
    export CDPATH=:~/src:~/work

    Different folders are separated by colon. A colon at start or end is shortcut for indicating the home directory. 

    What you get:
    1. You can switch to any folders under the folders specified in $CDPATH by just providing the folder name without specifying the full path. e.g.
      to switch to "~/src/graphite" you can just say "cd graphite"
    2. You also get tab completion based on folders under CDPATH folders. This is incredibly useful.
    3. If a folder with same name is present in multiple locations then the current directory gets the highest preference followed by the order of folders specified in $CDPATH. This makes sure that CDPATH doesn't confuse your existing workflow in any way, as current directory always gets priority.
    This is one of those tricks you spend five minutes on one day and then enjoy the benefits for the rest of your life.