Key to the GBIF portal is the crawling, processing and indexing of the content shared through the GBIF network, which is currently performed by the Harvesting and Indexing Toolkit (HIT). Today the HIT operates largely as follows:
- Synchronise with the registry to discover the technical endpoints
- Allow the administrator to schedule the harvest and process of an endpoint, as follows:
- Initiate a metadata request to discover the datasets at the endpoint
- For each resource initiate a request for the inventory of distinct scientific names
- Process the names into ranges
- Harvest the records by name range
- Process the harvested responses into tab delimited files
- Synchronise the tab delimited files with the database "verbatim" tables
- Process the "verbatim" tables into interpreted tables
Some of the limitations in this model include:
- The tight coupling between the HIT and the target DB mean we need to stop the harvesting when we are going to perform very expensive processing on the database
- Changes to the user interface for the HIT require the harvester to be stopped
- The user interface console is driven by the same machine that is crawling, meaning the UI becomes unresponsive periodically.
- The tight coupling between the HIT and the target DB preclude the option of storing in multiple datastores (as is current desire as we investigate enriching the occurrence store)
The HIT can be separated into the following distinct concerns:
- An administration console to allow the scheduling, oversight and diagnostics of crawlers
- Crawlers that harvest the content
- Synchronisers that interpret and persist the content into the target datastores
An event driven architecture would allow this to happen and overcome the current limitations. In this model, components can be deployed independently, and message each other through a queue when significant events occur . Subscribers to the queue determine what action if any to take on a per message basis. The architecture under research is shown:
In this depiction, the following sequence of events would occur:
- Through the Administration console, the administrator schedules the crawling of a resource.
- The scheduler broadcasts to the queue that the resource is to be crawled rather than spawning a crawler directly.
- When capacity allows, a crawler will act on this event and crawl the resource, storing to the filesystem as it goes. On each response message, the crawler will broadcast that the response is to be handled.
- Synchronizers will act on the new response messages and store them in the occurrence target stores. In the above depiction, there are actually 2 target stores, each of which would act on the message indicating there is new data to synchronise.
As an aside, during this exercise we are also investigating improvements in the following:
- The HIT (today) performs the metadata request, but does NOT update the registry with the datasets that are discovered, only the data portal. The GBIF registry is "dataset aware" for the datasets served through the Integrated Publishing Toolkit and ultimately we intend the registry to be able to reconcile the multiple identifiers associated with a dataset. For example, it should be possible in the future to synchronise with the like of the Biodiversity Collections Index which is a dataset level registry.
- The harvesting procedure is rather complex, with many points for failure; it involves inventories of scientific names, processing into ranges of names and a harvest based on the name ranges. Early tests suggest a more simpler approach of discrete name ranges [Aaa-Aaz, Aba-Abz ... Zza Zzz] yield better results.