Friday, 27 May 2011

The Phantom Records Menace

For a data administrator, going to the web test interface of data publisher can be incredibly useful if one needs to compare the data that was collected using the Harvesting and Indexing Toolkit: HIT and what is available from the publisher. In a perfect world transfer of records would happen without a glitch but when we eventually get less (or more!) than we asked for the search/test interfaces can be a real help (for instance the PyWrapper quering utilities)

Sometimes GBIF will index a resource that for no apparent reason turns in fewer records than what is expected from the line count that the HIT performs automatically. In this particular case there appears to be several identical records on top of that – which we are made aware of by the HIT that warns us that there are multiple records with the same “holy triplet”: Institution code, collection code and catalogue number.

Now what happens when a request goes out for this name range: Abies alba Mill. - Achillea millefolium L. followed by a request for Achillea millefolium agg. - Acinos arvensis (Lam.) Dandy? Those of you with good eyesight will have spotted that the request asks for Achillea millefolium L. before Achillea millefolium agg. This is because this particular instance or configuration of pywrapper returns a name range that is sorted according to the character values you find in UTF-8 and ASCII/Latin-1 which orders all upper-case characters before the lower-case ones. Whether this is an artifact of the underlying database system or the pywrapper itself, or even a specific version of the wrapper is not yet known, but the scenario exists today and consumers should be aware of this. The HIT then builds requests based on this name range and if the requests by chance divide between “Achillea millefolium L. and Achillea millefolium agg.” you will be receiving overlapping responses - that is two responses that contain parts of each other’s records – because the response is not based on a BINARY select statement and therefore returns the records alphabetically sorted without giving precedence to upper-case letters. This behavior can be replicated by going to the pywrapper interface and searching these name ranges. Fortunately the HIT removes the redundant records during the synchronizing process. However, the record count is based on the line count at the point where the records are received from the access point. This is why the record count in the HIT is inflated and as you see this kind of error can be am bit difficult to spot.

No comments:

Post a Comment