Friday 6 May 2011

Improving Hive join performance using Oozie

In the portal processing we are making use of Apache Hive to provide SQL capabilities and Yahoo!'s Oozie to provide a workflow engine.  In this blog I explain how we are making use of forks to improve the join performance of Hive, by further parallelizing the join beyond what Hive provides natively.
Please note that this was adopted using Hive version 0.5 but in Hive 0.7 there are significant improvements to joins
For the purposes of this explanation, let's consider the following simple example, where a table of verbatim values is being processed into four tables in a star schema:
To generate the leaves of the star, we have three simple queries (making use of a simple UDF to produce the increment IDs):

CREATE TABLE institution_code AS
SELECT rowSequence(), institution_code
FROM verbatim_record
GROUP BY institution_code;

CREATE TABLE collection_code AS
SELECT rowSequence(), collection_code
FROM verbatim_record
GROUP BY collection_code;

CREATE TABLE catalogue_number AS
SELECT rowSequence(), catalogue_number
FROM verbatim_record
GROUP BY catalogue_number;

To build the core of the star the simple approach is to issue the following SQL:

CREATE TABLE parsed_content AS
SELECT AS id, AS institution_code_id, AS collection_code_id, AS catalogue_number_id
FROM verbatim_record v 
  JOIN institution_code ic ON v.institution_code=ic.institution_code
  JOIN collection_code cc ON v.collection_code=cc.collection_code
  JOIN catalogue_number cn ON v.catalogue_number=cn.catalogue_number;

What is important to note is that the JOIN is across 3 different values, and this results in a query plan with three sequential MR jobs, a very large intermediate result set, which is ultimately passed through the final Reduce in the Hive planning.

By using Oozie (see the bottom of this post for pseudo workflow config), we are able to produce three temporary join tables, in a parallel fork, and then do a single join to bring it all back together.

# parallel join 1
SELECT AS id, AS institution_code_id 
FROM verbatim_record v JOIN institution_code ic ON v.institution_code=ic.institution_code;

# parallel join 2
SELECT AS id, AS collection_code_id 
FROM verbatim_record v JOIN collection_code cc ON v.collection_code=cc.collection_code

# parallel join 3
SELECT AS id, AS institution_code_id 
FROM verbatim_record v JOIN catalogue_number cn ON v.catalogue_number=cn.catalogue_number;

CREATE TABLE parsed_content AS
SELECT AS id, t1.institution_code_id
t2.collection_code_id, t3.catalogue_number_id
FROM verbatim_record v 
  JOIN t1 ic ON
  JOIN t2 cc ON
  JOIN t3 cn ON;

Because we have built the join tables in parallel, and join on the foreign key only, Hive compiles to a single MR job, and runs much quicker.

In reality our tables are far more complex, and we use a Map side JOIN for the institution_code since it is small, but for our small cluster and the following table sizes we saw a reduction from several hours to 40 minutes to compute these tables.
  • verbatim_record: 284 million
  • collection_code: 1.5 million
  • catalogue_number: 199 million
  • institution_code: 8 thousand
All of this work can be found here

Pseudo workflow config for this:










No comments:

Post a Comment