Tuesday 10 May 2011

Reworking the HIT, after reworking the Portal processing

If GBIF reworks the Portal processing, then what would be the knock-on effect on the Harvesting and Indexing Toolkit (HIT)? This blog serves to talk a little about the future of the HIT, and very little about the new Portal processing (saved for later blogs).

To provide some background, the HIT has three major responsibilities:
  1. harvesting specimen and occurrence data from data publishers,
  2. writing that data in its raw form to the database, and 
  3. transforming raw data into its processed form running quality assurance routines (such as date and terrestrial point validation) and tying it to the backbone "nub" taxonomy.

When it is complete, the new Portal processing is actually going to do step 3. In the new processing, data will be extracted from the MySQL database into HBase (using sqoop) where quality assurance routines can be run much more quickly. Running outside of the MySQL database means that there won't be any more competition between steps 2 and 3 - step 3 constantly locking the raw data table in order to run its routines. That will mean the HIT will be able to write raw data uninterrupted to the database.

Lately I can tell you that the HIT has been having some frustrations trying to process large datasts. For example, a dataset with 12 million records, processing 10,000 records at a time, would lock the raw table for 10 minutes while scanning through the more than 280 million raw records in order to generate its record set. No raw data can be written at that time, thereby bringing the massively parallel application to its knees. Perhaps now you can understand why the rework of the Portal processing is so urgently needed.

For the few adopters of the HIT that will still require the application with its current functionality please rest assured that the project will just maintain a separate trimmed-down version when the time comes to adapt it. It will always remain an open-source application that anyone in the community can customize for their own needs.

No comments:

Post a Comment