Monday 10 December 2012

"I noticed that the GBIF data portal has fewer records than it used to – what happened?"


If you are a regular user of the GBIF data portal at http://data.gbif.org, or keep an eye on the numbers given at http://www.gbif.org, you may have noticed that the number of indexed records took a dip, from well over 389m records to a little more than 383m. Why would that be?

The main reason for this is that software and processing upgrades have made it easier to spot duplicates and old, no longer published versions of records and datasets. Since the previous version of the data index, some major removal of such records has taken place:
  
-          Several publishers migrated their datasets from other publishing tools to the Integrated Publishing Toolkit (IPT) and Darwin Core Archive, and in the process identified and removed duplicate records in the published source data. As an additional effect, the use of Darwin Core Archives in publishing allows the indexing process to automatically remove records from the index that are no longer contained in the source file: a data transfer is reliably all-or-nothing, so that any record that is not touched during indexing can automatically be deleted. This is less easy in the dialog-driven data transfer protocols (DiGIR, BioCASe and TAPIR), where data transfer might fail at any point in between for a number of reasons, requiring human supervision of deletions.
-          The now dataset-aware registry and changed metadata updating workflow make it possible to much easier spot data resources that are no longer published at source, and therefore need to be removed from the data portal as well. Previously, such checks were manual and required regular screening. More often than not, datasets are not really withdrawn, but instead published under a new identifier, combined with other data, or moved to a new location, all with the old version still hanging in until spotted or pointed out. The new registry workflows will significantly speed up the process of detecting and handling such cases.

In summary, the current drop in numbers is the result of data cleaning and removal of duplicates, and reflects continuing efforts by publishers, nodes and the Secretariat to improve the quality of data accessible through the GBIF network. While they happen regularly, the effects of such cleaning activities often get masked by increased record numbers of existing resources and new datasets in the global index. This time, the reduction happens to be more prominent than the additions.