Tuesday 4 March 2014

Lots of columns with Hive and HBase

We're in the process of rolling out a long awaited feature here at GBIF, namely the indexing of more fields from Darwin Core. Until the launch of our now HBase-backed occurrence store (in the fall of 2013) we couldn't index more than about 30 or so terms from Darwin Core because we were limited by our MySQL schema. Now that we have HBase we can add as many columns as we like!

Or so we thought.

Our occurrence download service gets a lot of use and naturally we want downloaders to have access to all of the newly indexed fields. The way our downloads work is as an Oozie workflow that executes a Hive query of an HDFS table (more details in this Cloudera blog). We use an HDFS table to significantly speed up the scan speed of the query - using an HBase backed Hive table takes something like 4-5x as long. But to generated that HDFS table we need to start from a Hive table that _is_ backed by HBase.

Here's an example of how to write a Hive table definition for an HBase-backed table:

CREATE EXTERNAL TABLE tiny_hive_example (
  key INT,
  kingdom STRING,
  kingdomkey INT
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key#b,o:kingdom#s,o:kingdomKey#b")
  "hbase.table.name" = "tiny_hbase_table",
  "hbase.table.default.storage.type" = "binary"

But now that we have something like 600 columns to map to HBase, and that we've chosen to name our HBase columns just like the DwC Terms they represent (e.g. the basis of record term's column name is basisOfRecord) we have a very long "SERDEPROPERTIES" string in our Hive table definition. How long? Well, way more than the 4000 character limit of Hive. For our Hive metastore we use PostgreSQL and when Hive creates the SERDE_PARAMS table it gives the PARAM_VALUE column a datatype of VARCHAR(4000). Because 4k should be enough for anyone, right? Sigh.

The solution:

alter table "SERDE_PARAMS" alter column "PARAM_VALUE" type text;

We did lots of testing to make sure the existing definitions didn't get nuked by this change, and can confirm that the Hive code is not checking that 4000 value either (value is turned into a String: the source). Our new super-wide downloads table works, and will be in production soon!


  1. I was reading the Cloudera blog which included a nice architectural picture of how you've been using Hadoop, Hbase, Oozie, etc... Would you please create a new blog post with an update of that image (now three years or more later)?

  2. Do you happen to auto-generate the Hive table schema somehow? I'm looking to auto-generate Hive table schemas based on the actual key:value data in one or more HBase column families.

    (I'm not sure my comment posted, as the form reset and didn't post anything to the comment stream, so forgive what may seem a double-post)

  3. Thanks a lot, I struggled for 3 days , figuring out why I was not able to map my HBase to Hive with 279 columns, Finally your solution worked like a charm

  4. Thanks a lot for this, I struggled for two entire days trying to figure out why my table definition would get corrupt as soon as i create them with 549 columns, this article finally helped me understand what was going one.