Monday, 30 May 2011

Decoupling components

Recent blog posts have introduced some of the registry and portal processing work under development at GBIF.  Here I'd like to introduce some of the research underway to  improve the overall processing workflows by identifying well defined components and decoupling unnecessary dependencies.  The target being to improve the robustness, reliability and throughput of the data indexing performed for the portal.

Key to the GBIF portal is the crawling, processing and indexing of the content shared through the GBIF network, which is currently performed by the Harvesting and Indexing Toolkit (HIT).  Today the HIT operates largely as follows:
  1. Synchronise with the registry to discover the technical endpoints
  2. Allow the administrator to schedule the harvest and process of an endpoint, as follows:
    1. Initiate a metadata request to discover the datasets at the endpoint
    2. For each resource initiate a request for the inventory of distinct scientific names
    3. Process the names into ranges 
    4. Harvest the records by name range
    5. Process the harvested responses into tab delimited files
    6. Synchronise the tab delimited files with the database "verbatim" tables
    7. Process the "verbatim" tables into interpreted tables
Logically the HIT is depicted:
Some of the limitations in this model include:

  1. The tight coupling between the HIT and the target DB mean we need to stop the harvesting when we are going to perform very expensive processing on the database
  2. Changes to the user interface for the HIT require the harvester to be stopped
  3. The user interface console is driven by the same machine that is crawling, meaning the UI becomes unresponsive periodically.
  4. The tight coupling between the HIT and the target DB preclude the option of storing in multiple datastores (as is current desire as we investigate enriching the occurrence store)

The HIT can be separated into the following distinct concerns:

  1. An administration console to allow the scheduling, oversight and diagnostics of crawlers
  2. Crawlers that harvest the content 
  3. Synchronisers that interpret and persist the content into the target datastores  

An event driven architecture would allow this to happen and overcome the current limitations.  In this model, components can be deployed independently, and message each other through a queue when significant events occur .  Subscribers to the queue determine what action if any to take on a per message basis.  The architecture under research is shown:
In this depiction, the following sequence of events would occur:

  1. Through the Administration console, the administrator schedules the crawling of a resource.  
  2. The scheduler broadcasts to the queue that the resource is to be crawled rather than spawning a crawler directly.  
  3. When capacity allows, a crawler will act on this event and crawl the resource, storing to the filesystem as it goes.  On each response message, the crawler will broadcast that the response is to be handled.
  4. Synchronizers will act on the new response messages and store them in the occurrence target stores.  In the above depiction, there are actually 2 target stores, each of which would act on the message indicating there is new data to synchronise.
This architecture would have significant improvements to the existing setup.  The crawlers would only ever stop when bug fixing in the crawlers themselves occurs.  Different target stores can be researched independently of the crawling codebase.  The user interface for the scheduling can be developed, and redeployed without interrupting the crawling.  

As an aside, during this exercise we are also investigating improvements in the following:
  1. The HIT (today) performs the metadata request, but does NOT update the registry with the datasets that are discovered, only the data portal.  The GBIF registry is "dataset aware" for the datasets served through the Integrated Publishing Toolkit and ultimately we intend the registry to be able to reconcile the multiple identifiers associated with a dataset.  For example, it should be possible in the future to synchronise with the like of the Biodiversity Collections Index which is a dataset level registry.
  2. The harvesting procedure is rather complex, with many points for failure; it involves inventories of scientific names, processing into ranges of names and a harvest based on the name ranges.  Early tests suggest a more simpler approach of discrete name ranges [Aaa-Aaz, Aba-Abz ... Zza Zzz] yield better results.
Watch this space for results of this investigation...

No comments:

Post a Comment