Developer Blog: Reworking the Portal processing

Monday, 18 April 2011

Reworking the Portal processing

The GBIF Data Portal has provided a gateway to discover and access the content shared through the GBIF network for some years, without major change. As the amount of data has grown, GBIF have scaled vertically (e.g. scaling up) to maintain performance levels; this is becoming unmanageable with the current processing routines due to the amount of SQL statements issued against the database. As GBIF content grows, the indexing infrastructure must change to scale out accordingly.

I have been monitoring and evaluating alternative technologies for some time and a few months ago GBIF initiated the redevelopment of the processing routines. This current area of work does not increase functionality offered through the portal (that will be addressed following this infrastructural work) but rather aims to:

Reduce the latency between a record changing on the publisher side, and being reflected in the index
Reduce the amount of (wo)man-hours needed to coax through a successful processing run
Improve the quality assurance by inclusion of

Checking that terrestrial point locations fall within the stated country using shapefiles
Checking coastal waters using Exclusive Economic Zones

Rework all the date and time handling
Use dictionaries (vocabularies) for interpretation of fields such as Basis of Record
Integrate checklists (taxonomic, nomenclatural and thematic) shared through the GBIF ECAT Programme to improve the taxonomic services, and the backbone ("nub") taxonomy.
Provide a robust framework for future development
Allow the infrastructure to grow predictably with content and demand growth

Things have progressed significantly since my early investigations, and GBIF are developing using the following technologies:

Apache Hadoop: A distributed file system, and cluster processing using the Map Reduce framework

GBIF are using the Cloudera distribution of Hadoop

Sqoop: A utility to synchronize between relational databases and Hadoop
Hive: A data warehouse infrastructure built on top of Hadoop, and developed and open-sourced by Facebook. Hive gives SQL capabilities on Hadoop. [Full table scans on GBIF occurrence records reduce from hours to minutes]
Oozie: An open-source workflow/coordination service to manage data processing jobs for Hadoop, developed then open-sourced by Yahoo!

[GBIF are researching using HBase, the Hadoop database to allow an increase in the richness in the indexed content, and will be the subject of future blogs. See the project site]

The processing workflow looks like the following (click for full size):

The Oozie workflow is still being developed, but the workflow definition can be found here.

3 comments:

niconoe19 April 2011 at 09:53
Hi Tim, thanks for this post !

Is the HIT currently in use in Copenhagen ? If so, I imagine it is at the "first step" of your diagram ? If no, do you plan to use it in the future ?

thanks again,

Nico
ReplyDelete
Replies
Tim Robertson20 April 2011 at 11:48
Hi Nico.

The HIT (http://code.google.com/p/gbif-indexingtoolkit/) has been in use by GBIF for some years, and more recently in Australia to help build the Atlas of Living Australia (http://www.ala.org.au/). It is used to harvest and process the occurrence data only, as the names are all in the Darwin Core Archive format and trivial to harvest. It is only the more complex XML protocols (DiGIR, BioCASe and TAPIR) that need special harvesting strategies.
Be aware that as we develop, so must the HIT. We are targeting throughput required at global network crawling (e.g. 1000s record per second consistent), and therefore might make implementation decisions that don't apply to those harvesting at small scale. The HIT today is useful to help kickstart the development of a small scale harvesting - after all we use it today. Hope this helps.
ReplyDelete
Replies
Unknown3 May 2011 at 19:57
Good afternoon
I work with hive based on HDFS to optimize the GBIF request for user.It s my work goal on Master. Can you help me to work together on this project.
ReplyDelete
Replies

Add comment

Monday, 18 April 2011

Reworking the Portal processing

3 comments:

About this blog

Twitter

Contributors

Our favourite blogs

Blog Archive

Tags

Followers

Monday, 18 April 2011

Reworking the Portal processing

3 comments:

About this blog

Twitter

Contributors

Our favourite blogs

Blog Archive

Subscribe To

Tags

Followers