Tuesday, 14 May 2013

Migrating our hadoop cluster from CDH3 to CDH4

We've written a number of times on the initial setup, eventual upgrade and continued tuning of our hadoop cluster. Our latest project has been upgrading from CDH3u3 to CDH4.2.1. Upgrades are almost always disruptive, but we decided it was worth the hassle for a number of reasons:

  • general performance improvements in the entire Hadoop/HBase stack
  • continued support from the community/user list (a non-trivial concern - anybody asking questions on the user groups and mailing list about problems with older clusters are invariably asked to update before people are interested in tackling the problem)
  • multi-threaded compactions (the need for which we concluded in this post)
  • table-based region balancing (rather than just cluster-wide)
We had been managing our cluster primarily using Puppet, with all the knowledge of how the bits worked together firmly within our dev team. In an effort to make everyone's lives easier, reduce our bus factor, and get the server management back into the hands of our ops team, we've moved to CDH Manager to control our CDH installation. That's been going pretty well so far, but, we're getting ahead of ourselves...

The Process

We have 6 slave nodes that have a lot of disk capacity since we spec'd with a goal of lots of spindles which meant we got lots of space "for free". Rather than upgrading in place, we decided to start fresh with new master & zookeeper nodes, and we calculated that we'd have enough space to pull half the slaves into the new cluster without losing any data. We cleaned up all the tmp files and anything we deemed not worth saving from HBase and hdfs, and started the migration:

Reduce the replication factor

We reduced the replication factor to 2 on the 6 slave nodes to reduce the disk use:

hadoop fs -setrep -R 2 /

Decommission the 3 nodes to move

"Decommissioning" is the civilized and safe way to remove nodes from a cluster where there's risk that they contain the only copies of some data in the cluster (they'll block writes but accept reads until all blocks have finished replicating out). To do it add the names of the target machines to an "excludes" file (one per line) that your hdfs config needs to reference, and then refresh hdfs.

The block in hdfs-site.xml:


then run:

bin/hadoop dfsadmin -refreshNodes

and wait for the "under replicated blocks" count on the hdfs admin page to drop to 0 and the decommissioning nodes to move into state "Decommissioned".

Don't forget HBase

The hdfs datanodes are tidied up now but don't forget to cleanly shutdown the HBase regionservers - run:

./bin/graceful_stop.sh HOSTNAME

from within the HBase directory on the host you're shutting down (specifying the real name for HOSTNAME). It will shed its regions and shutdown when tidied up (more details here).

Now you can shutdown the tasktracker and datanode, and then the machine is ready to be wiped.

Build the new cluster

We wiped the 3 decommissioned slave nodes and installed the latest version of CentOS (our linux of choice, version 6.4 at time of writing). We also pulled 3 much lesser machines from our other cluster after decommissioning them in the same way, and installed CentOS 6.4 there, too. The 3 lesser machines would form our zookeeper ensemble and master nodes in the new cluster.

Enter CDH Manager

The folks at Cloudera have made a free version of their CDH Manager app available, and it makes managing a cluster much, much easier. After setting up the 6 machines that would form the basis of our new cluster with just the barebones OS, we were ready to start wielding the manager. We made a small VM to hold the manager app and installed it there. The manager instructions are pretty good, so I won't recreate them here. We had trouble with the key-based install so had to resort to setting identical passwords for root and allowing root ssh access for the duration of the install, but other than that it all went pretty smoothly. We installed in the following configuration (the master machines are the lesser ones described above, and the slaves the more powerful machines).

Machine and Role assignments
Machine Roles
master1 HDFS Primary NameNode, Zookeeper Member, HBase Master (secondary)
master2 HDFS Secondary NameNode, Zookeeper Member, HBase Master (primary)
master3 Hadoop JobTracker, Zookeeper Member, HBase Master (secondary)
slave1 HDFS DataNode, Hadoop TaskTracker, HBase Regionserver
slave2 HDFS DataNode, Hadoop TaskTracker, HBase Regionserver
slave3 HDFS DataNode, Hadoop TaskTracker, HBase Regionserver

Copy the data

Now we had two running clusters - our old CDH3u3 cluster (with half its machines removed) and the new, empty CDH 4.2.1 cluster. The trick was how to get data from the old cluster into the new, with our primary concern being the data in HBase. The builtin facility for this sort of thing is called CopyTable, and sounds great, except that it doesn't work across major versions of HBase, so that was out. Next we looked at copying the HFiles directly from the old cluster to the new using the HDFS builtin command distcp. Because we could handle shutting down HBase on the old cluster for the duration of the copy this, in theory, should work - newer versions of HBase can read the older versions' HFiles and then write the new versions during compactions (and by shutting down we don't run the risk of missing updates from caches that haven't flushed, etc). And in spite of lots of warnings around the net that it wouldn't work, we tried it anyway. And it didn't work :) We managed to get the -ROOT- table up but it couldn't find .META. and that's where our patience ended. The next, and thankfully successful, attempt was using HBase export, distcp, and HBase import.

On the old cluster we ran:

bin/hadoop jar hbase-0.90.4-cdh3u3.jar export table_name /exports/table_name

for each of our tables, which produced a bunch of sequence files in the old cluster's HDFS. Those we copied over to the new cluster using HDFS's distcp command:

bin/hadoop distcp hftp://old-cluster-namenode:50070/exports/table_name hdfs://master1:8020/imports/table_name

which takes advantage of the builtin http-like interface (hftp) that HDFS provides, which makes the copy process version agnostic.

Finally on the new cluster we can import the copied sequence files into the new HBase:

bin/hadoop jar hbase-0.94.2-cdh4.2.1-security.jar import table_name /imports/table_name

Make sure the table exists before you import, and because the import is a mapreduce job that does Puts, it would also be wise to presplit any large tables at creation time so that you don't crush your new cluster with lots of hot regions and splitting. Also one known issue in this version of HBase is a performance regression from version 0.92 to 0.94 (detailed in HBASE-7868), which you can workaround by adding the following to your table definition:


e.g. create 'test_table', {NAME=>'cf', COMPRESSION=>'SNAPPY', VERSIONS=>1, DATA_BLOCK_ENCODING => 'FAST_DIFF'}

As per that linked issue, you should also enable short-circuit reads from the CDH Manager interface.

And to complete the copying process, run major compactions on all your tables to ensure the best data locality you can for your regionservers.

All systems go

After running checks on the copied data, and updating our software to talk to CDH4, we were happy that our new cluster was behaving as expected. To get back to our normal performance levels we then shutdown the remaining machines in the CDH3u3 cluster, wiped and installed the latest OS, and then told CDH Manager to install on them. A few minutes later we had all our M/R slots back, as well as our regionservers. We ran the HBase balancer to evenly spread out the regions, ran another major compaction on our tables to force data-locality, and we were back in business!


  1. ddi you have to stop hbase master/region on the old cluster during local export ?

    1. No, we didn't have to stop the old cluster - stopping is only needed if you want to use distcp. Doing an export from a running cluster is one of the backup strategies suggested by the HBase guys (http://hbase.apache.org/book/ops.backup.html)