Developer Blog: Lots of columns with Hive and HBase

Tuesday, 4 March 2014

Lots of columns with Hive and HBase

We're in the process of rolling out a long awaited feature here at GBIF, namely the indexing of more fields from Darwin Core. Until the launch of our now HBase-backed occurrence store (in the fall of 2013) we couldn't index more than about 30 or so terms from Darwin Core because we were limited by our MySQL schema. Now that we have HBase we can add as many columns as we like!

Or so we thought.

Our occurrence download service gets a lot of use and naturally we want downloaders to have access to all of the newly indexed fields. The way our downloads work is as an Oozie workflow that executes a Hive query of an HDFS table (more details in this Cloudera blog). We use an HDFS table to significantly speed up the scan speed of the query - using an HBase backed Hive table takes something like 4-5x as long. But to generated that HDFS table we need to start from a Hive table that _is_ backed by HBase.

Here's an example of how to write a Hive table definition for an HBase-backed table:

CREATE EXTERNAL TABLE tiny_hive_example (
key INT,
kingdom STRING,
kingdomkey INT
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key#b,o:kingdom#s,o:kingdomKey#b")
TBLPROPERTIES(
"hbase.table.name" = "tiny_hbase_table",
"hbase.table.default.storage.type" = "binary"
);

But now that we have something like 600 columns to map to HBase, and that we've chosen to name our HBase columns just like the DwC Terms they represent (e.g. the basis of record term's column name is basisOfRecord) we have a very long "SERDEPROPERTIES" string in our Hive table definition. How long? Well, way more than the 4000 character limit of Hive. For our Hive metastore we use PostgreSQL and when Hive creates the SERDE_PARAMS table it gives the PARAM_VALUE column a datatype of VARCHAR(4000). Because 4k should be enough for anyone, right? Sigh.

The solution:

alter table "SERDE_PARAMS" alter column "PARAM_VALUE" type text;

We did lots of testing to make sure the existing definitions didn't get nuked by this change, and can confirm that the Hive code is not checking that 4000 value either (value is turned into a String: the source). Our new super-wide downloads table works, and will be in production soon!

4 comments:

Mitch Thomas3 October 2014 at 01:32
I was reading the Cloudera blog which included a nice architectural picture of how you've been using Hadoop, Hbase, Oozie, etc... Would you please create a new blog post with an update of that image (now three years or more later)?
ReplyDelete
Replies
Unknown30 October 2014 at 01:33
Do you happen to auto-generate the Hive table schema somehow? I'm looking to auto-generate Hive table schemas based on the actual key:value data in one or more HBase column families.

(I'm not sure my comment posted, as the form reset and didn't post anything to the comment stream, so forgive what may seem a double-post)
ReplyDelete
Replies
Neha Jirafe17 August 2015 at 10:18
Thanks a lot, I struggled for 3 days , figuring out why I was not able to map my HBase to Hive with 279 columns, Finally your solution worked like a charm

ReplyDelete
Replies
dev4 September 2015 at 18:11
Thanks a lot for this, I struggled for two entire days trying to figure out why my table definition would get corrupt as soon as i create them with 549 columns, this article finally helped me understand what was going one.
ReplyDelete
Replies

Add comment

Tuesday, 4 March 2014

Lots of columns with Hive and HBase

4 comments:

About this blog

Twitter

Contributors

Our favourite blogs

Blog Archive

Tags

Followers

Tuesday, 4 March 2014

Lots of columns with Hive and HBase

4 comments:

About this blog

Twitter

Contributors

Our favourite blogs

Blog Archive

Subscribe To

Tags

Followers