Sometimes GBIF will index a resource that for no apparent reason turns in fewer records than what is expected from the line count that the HIT performs automatically. In this particular case there appears to be several identical records on top of that – which we are made aware of by the HIT that warns us that there are multiple records with the same “holy triplet”: Institution code, collection code and catalogue number.
Now what happens when a request goes out for this name range: Abies alba Mill. - Achillea millefolium L. followed by a request for Achillea millefolium agg. - Acinos arvensis (Lam.) Dandy? Those of you with good eyesight will have spotted that the request asks for Achillea millefolium L. before Achillea millefolium agg. This is because this particular instance or configuration of pywrapper returns a name range that is sorted according to the character values you find in UTF-8 and ASCII/Latin-1 which orders all upper-case characters before the lower-case ones. Whether this is an artifact of the underlying database system or the pywrapper itself, or even a specific version of the wrapper is not yet known, but the scenario exists today and consumers should be aware of this. The HIT then builds requests based on this name range and if the requests by chance divide between “Achillea millefolium L. and Achillea millefolium agg.” you will be receiving overlapping responses - that is two responses that contain parts of each other’s records – because the response is not based on a BINARY select statement and therefore returns the records alphabetically sorted without giving precedence to upper-case letters. This behavior can be replicated by going to the pywrapper interface and searching these name ranges. Fortunately the HIT removes the redundant records during the synchronizing process. However, the record count is based on the line count at the point where the records are received from the access point. This is why the record count in the HIT is inflated and as you see this kind of error can be am bit difficult to spot.
No comments:
Post a Comment