Tuesday 20 March 2012

HBase Performance Evaluation continued - The Smoking Gun

Update: See also part 1 and part 3.

In my last post I described my initial foray into testing our HBase cluster performance using the PerformanceEvaluation class.  I wasn't happy with our conclusions, which could largely be summed up as "we're not sure what's wrong, but it seems slow".  So in the grand tradition of people with itches that wouldn't go away, I kept scratching.  Everything that follows is based on testing with PerformanceEvaluation (the jar patched as in the last post) using a 300M row table built with PerformanceEvaluation sequentialWrite 300, and tested with PerformanceEvaluation scan 300.  I ran the scan test 3 times, so you should see 3 distinct bursts of activity in the charts.  And to recap our hardware setup - we have 3 regionservers and a separate master.

The first unsettling ganglia metric that kept me digging was of ethernet bytes_in and bytes_out.  I'll recreate those here:
Figure 1 - bytes_in (MB/s) with single gigabit ethernet
Figure 2 - bytes_out (MB/s) with single gigabit ethernet
This is a single gigabit ethernet link, and that is borne out by the Max values, which are near the theoretical max of 120 megabytes per second (MB/s), and as you can see they certainly aren't sitting at 120 constantly in the way one would expect if ethernet was our bottleneck. But that bytes_in graph sure looks like it's hitting some kind of limit - steep ramp up, relatively constant transfer rate, then steep ramp down.  So I decided to try channel bonding to effectively use 2 gigabit links on each server "bonded" together.  It turns out that's not quite as trivial as it may seem, but in the end I got it working following the Dell/Cloudera configuration suggestion of mode 6 / balance-alb, which is described more in the Linux kernel bonding documentation. The results are shown in the following graphs:

Figure 3 - bytes_in (MB/s) with dual (bonded) gigabit ethernet
Figure 4 - bytes_out (MB/s) with dual (bonded) gigabit ethernet
Unfortunately the scale has changed (ganglia ain't perfect) but you can see we're now hovering a bit higher than the single ethernet case, and at some points swinging significantly above the single ethernet limit (a few gusts up to 150).  So it would seem that even though the single gigabit graphs didn't look like they were hitting theoretical maximum, they were definitely limited.  Not by much though - maybe 10-15%?  As they say, the proof is in the pudding, so what was the final throughput for the two tests?

EthernetThroughput (rows/s)

Looks like 5%. Certainly not the doubling I was hoping for! Something else must be the limiting factor then, but what?  Well, what does CPU say?
Figure 5 - cpu_idle with dual (bonded) gigabit ethernet
That CPU is saying, "Feed me Seymour!".  If the CPU is hungry, what about the load average?
Figure 6 - one minute load average with dual (bonded) gigabit ethernet
These machines have 8 hyper-threaded cores, so effectively 16 cpus. The load average looks like it's keeping the CPUs fed - hovering around 20. But hang on - the load average isn't all about CPU - it's about work that the whole machine needs to do - including all the io subsystems. So if it ain't network, and it ain't CPU then it's gotta be disks. Let's look at how much time the CPU spent waiting on io (which in this case basically means "waiting on disks"):
Figure 7 - percentage cpu spent in wait_io state, with dual (bonded) gigabit ethernet
From a pure gut perspective, that seems kind of high.  But on closer inspection, that graph also looks familiar - it's basically the same as Figure 6 - the load average. So what can we deduce?  When ethernet is removed as a limiting factor the run queue is filled with processes that all cause an increase in wait_io - which means our processes would all finish faster if they didn't have to wait for io.  The limiting factor must, then, be the disks.

Recap (TL;DR)
Ethernet looked suspiciously capped, but performance (and ethernet usage) only increased slightly when the cap was lifted. Closer inspection of the resulting load average and CPU usage showed that the limiting factor was in fact the disks.  

Solution?  Replacing the disks might help a little - they're 3 year old 2.5" SATA, but they are 7.2k rpm, and a more modern 10k 2.5" would be marginally faster.  The real benefit would probably be more disks - the current wisdom on the mailing lists and in the documentation is to maximize the ratio of disk spindles to cpu cores and I think that's borne out by these results.  In the coming weeks we're going to build a new cluster of 6 machines with 12 disks each and I very much look forward to testing my theories once they're up and running.  Stay tuned!


  1. I wonder what would happen if you upped the hfile/hbase block size Oliver? Default is 64k. If you upped it to 256k, given you are optimizing for scan case, maybe you'd be able to move through more data for same number of seeks/iowait?

    1. @stack - I'm happy to give it a try, but as I understand it HBase never touches the actual disks - that's HDFS's job. So how would changing the HFile block size have any impact on accessing the physical disks?

    2. Since you would load 256KB instead of 64KB, you'll in theory do 4x less access to disk. It doesn't drop the blocks in-between rows. Since you're doing a sequential write, the next row you're looking for should always be either in the same block you're in or in the very following one.

    3. Thanks JD. After some discussion with Lars I think I understand the point now - I was under the mistaken impression that HDFS cached its blocks, which are way bigger than HFile's blocks, so in that case a bigger HFile block wouldn't matter. But in theory it does. In practice though, I tried it and actually saw a small decrease in performance (a few percent). Because these are almost never apples-to-apples comparisons (I had to reboot one of the machines, regions moved around, I did a major compaction, etc) I'm not surprised that the difference is negligible. The theory is sound, though, so maybe I'll rebuild the test table into 1MB blocks and see what happens.

  2. Has it been tested that channel bonding actually resulted in 2 gigabit throughput between computers in this setup? Netcat or iperf could be handy to verify that.
    The switch in the network might still be limiting overall bandwidth to 1 gbps - it looks like on fig 3 every time the throughput on one box goes above 120 MB/s there is a corresponding drop.

    1. I didn't use netcat or iperf but instead setup httpd on each machine and then did a GET with curl to fetch an 8gb file (repeatedly). With that I verified that I could get 2x gigabit up and down from each machine (and even tried a 2nd switch just to make sure). But because there are only 3 machines and 2 links per machine, the way balance-alb works means I could only ever get 1 gigabit between any two machines (balance-alb feeds a single MAC address to each other host on the network). I think that's why the graph has those suspicious looking inversions.