I have been monitoring and evaluating alternative technologies for some time and a few months ago GBIF initiated the redevelopment of the processing routines. This current area of work does not increase functionality offered through the portal (that will be addressed following this infrastructural work) but rather aims to:
- Reduce the latency between a record changing on the publisher side, and being reflected in the index
- Reduce the amount of (wo)man-hours needed to coax through a successful processing run
- Improve the quality assurance by inclusion of
- Checking that terrestrial point locations fall within the stated country using shapefiles
- Checking coastal waters using Exclusive Economic Zones
- Rework all the date and time handling
- Use dictionaries (vocabularies) for interpretation of fields such as Basis of Record
- Integrate checklists (taxonomic, nomenclatural and thematic) shared through the GBIF ECAT Programme to improve the taxonomic services, and the backbone ("nub") taxonomy.
- Provide a robust framework for future development
- Allow the infrastructure to grow predictably with content and demand growth
- Apache Hadoop: A distributed file system, and cluster processing using the Map Reduce framework
- GBIF are using the Cloudera distribution of Hadoop
- Sqoop: A utility to synchronize between relational databases and Hadoop
- Hive: A data warehouse infrastructure built on top of Hadoop, and developed and open-sourced by Facebook. Hive gives SQL capabilities on Hadoop. [Full table scans on GBIF occurrence records reduce from hours to minutes]
- Oozie: An open-source workflow/coordination service to manage data processing jobs for Hadoop, developed then open-sourced by Yahoo!
[GBIF are researching using HBase, the Hadoop database to allow an increase in the richness in the indexed content, and will be the subject of future blogs. See the project site]
The processing workflow looks like the following (click for full size):
The Oozie workflow is still being developed, but the workflow definition can be found here.
Hi Tim, thanks for this post !
ReplyDeleteIs the HIT currently in use in Copenhagen ? If so, I imagine it is at the "first step" of your diagram ? If no, do you plan to use it in the future ?
thanks again,
Nico
Hi Nico.
ReplyDeleteThe HIT (http://code.google.com/p/gbif-indexingtoolkit/) has been in use by GBIF for some years, and more recently in Australia to help build the Atlas of Living Australia (http://www.ala.org.au/). It is used to harvest and process the occurrence data only, as the names are all in the Darwin Core Archive format and trivial to harvest. It is only the more complex XML protocols (DiGIR, BioCASe and TAPIR) that need special harvesting strategies.
Be aware that as we develop, so must the HIT. We are targeting throughput required at global network crawling (e.g. 1000s record per second consistent), and therefore might make implementation decisions that don't apply to those harvesting at small scale. The HIT today is useful to help kickstart the development of a small scale harvesting - after all we use it today. Hope this helps.
Good afternoon
ReplyDeleteI work with hive based on HDFS to optimize the GBIF request for user.It s my work goal on Master. Can you help me to work together on this project.