Monday 18 April 2011

Reworking the Portal processing

The GBIF Data Portal has provided a gateway to discover and access the content shared through the GBIF network for some years, without major change.  As the amount of data has grown, GBIF have scaled vertically (e.g. scaling up) to maintain performance levels; this is becoming unmanageable with the current processing routines due to the amount of SQL statements issued against the database.  As GBIF content grows, the indexing infrastructure must change to scale out accordingly.

I have been monitoring and evaluating alternative technologies for some time and a few months ago GBIF initiated the redevelopment of the processing routines.  This current area of work does not increase functionality offered through the portal (that will be addressed following this infrastructural work) but rather aims to:
  • Reduce the latency between a record changing on the publisher side, and being reflected in the index
  • Reduce the amount of (wo)man-hours needed to coax through a successful processing run
  • Improve the quality assurance by inclusion of    
  • Rework all the date and time handling
  • Use dictionaries (vocabularies) for interpretation of fields such as Basis of Record
  • Integrate checklists (taxonomic, nomenclatural and thematic) shared through the GBIF ECAT Programme to improve the taxonomic services, and the backbone ("nub") taxonomy.
  • Provide a robust framework for future development
  • Allow the infrastructure to grow predictably with content and demand growth
Things have progressed significantly since my early investigations, and GBIF are developing using the following technologies:
    • Apache Hadoop: A distributed file system, and cluster processing using the Map Reduce framework
    • Sqoop: A utility to synchronize between relational databases and Hadoop 
    • Hive: A data warehouse infrastructure built on top of Hadoop, and developed and open-sourced by Facebook.  Hive gives SQL capabilities on Hadoop.  [Full table scans on GBIF occurrence records reduce from hours to minutes]
    • Oozie: An open-source workflow/coordination service to manage data processing jobs for Hadoop, developed then open-sourced by Yahoo!
    [GBIF are researching using HBase, the Hadoop database to allow an increase in the richness in the indexed content, and will be the subject of future blogs.  See the project site]

    The processing workflow looks like the following (click for full size):

    The Oozie workflow is still being developed, but the workflow definition can be found here.

    3 comments:

    1. Hi Tim, thanks for this post !

      Is the HIT currently in use in Copenhagen ? If so, I imagine it is at the "first step" of your diagram ? If no, do you plan to use it in the future ?

      thanks again,

      Nico

      ReplyDelete
    2. Hi Nico.

      The HIT (http://code.google.com/p/gbif-indexingtoolkit/) has been in use by GBIF for some years, and more recently in Australia to help build the Atlas of Living Australia (http://www.ala.org.au/). It is used to harvest and process the occurrence data only, as the names are all in the Darwin Core Archive format and trivial to harvest. It is only the more complex XML protocols (DiGIR, BioCASe and TAPIR) that need special harvesting strategies.
      Be aware that as we develop, so must the HIT. We are targeting throughput required at global network crawling (e.g. 1000s record per second consistent), and therefore might make implementation decisions that don't apply to those harvesting at small scale. The HIT today is useful to help kickstart the development of a small scale harvesting - after all we use it today. Hope this helps.

      ReplyDelete
    3. Good afternoon
      I work with hive based on HDFS to optimize the GBIF request for user.It s my work goal on Master. Can you help me to work together on this project.

      ReplyDelete