Monday, 29 October 2012

The GBIF Registry is now dataset-aware!

This post continues the series of posts that highlight the latest updates on the GBIF Registry.

To recap, in April 2011 Jose Cuadra wrote The evolution of the GBIF Registry, a post that provided a background to the GBIF Network, explained how Network entities are now stored in a database instead of UDDI system, and how it has a new web application and API.  

Then a month later, Jose wrote another post entitled 2011 GBIF Registry Refactoring that was more technical in nature and detailed a new set of technologies chosen to improve the underlying codebase.

Now even if you have been keeping an eye on the GBIF Registry, you probably missed the most important improvement that happened in September 2012: the Registry is now dataset-aware! 

To be dataset-aware, means that the Registry is now aware of all the datasets that exist behind DiGIR and BioCASE endpoints. Just in case the reader isn't aware, DiGIR and BioCASE are wrapper tools used by organizations in the GBIF Network to publish their datasets. The datasets are exposed via an endpoint URL, and there can potentially be thousands of datasets behind a single endpoint. 

Traditionally, the GBIF Registry knew about the endpoint but not about its datasets. It was then the job of GBIF's Harvesting and Indexing Toolkit (HIT) to discover what datasets existed behind the endpoint, harvest all their records, and index those records into the GBIF Data Portal

Therefore if you ever visited the GBIF Data Portal and viewed the Portal page for the Academy of Natural Sciences, you would find that it has 3 datasets. 

Clicking on each one, reveals that they are all exposed via the same DiGIR endpoint (see "Access point URL") - see below:

But, if you visited the GBIF Registry and did the same search for the Academy of Natural Sciences, prior to the Registry being dataset-aware, you would have seen it has a DiGIR endpoint, but not found it has any datasets!

Now that the GBIF Registry is dataset-aware, however, the Registry page for the Academy of Natural Sciences shows that the organization owns 3 datasets, and has a (DiGIR) Technical Installation. 


So that's fantastic, now the GBIF Registry knows about 1000s of datasets that only the GBIF Data Portal knew about before. But how was dataset-awareness achieved? 

First, the Registry now does the job of dataset discovery that the HIT used to do. A project called the registry-metadata-sync was created to do this. 

Second, a special set of scripts was written to migrate all the datasets from the GBIF Data Portal index database, into the Registry database. For the first time, all datasets that existed in the GBIF Data Portal now exist in the GBIF Registry, and can be uniquely identified by their GBIF Registry UUID!

Third, the HIT was branched, creating a revised version of the tool that was able to understand the new dataset-aware Registry. The HIT also had to be modified to allow its operators to still trigger dataset discovery by technical installation. Life just got easier for the HIT though, since it could use each dataset's GBIF Registry UUID to uniquely identify each dataset during indexation. 

Indeed, the dataset-aware Registry allocates a UUID to each dataset. This is fundamentally the biggest advantage that the dataset-aware Registry brings. Now that GBIF has succeeded in uniquely identifying each Dataset in its Registry, it is now working to assign each Dataset a Globally Unique Identifier (GUID) in the form of a Digital Object Identifier (DOI). The DOI for a dataset will be resolvable back to the GBIF Registry, and could be referenced when citing a Dataset, thereby enabling better tracking of Dataset usage in scientific publications.

GBIF is really excited about being able to provide publishers a DOI for each of their dataset. Keep an eye on our Registry in the coming months for their grand appearance.   


  1. @kylebraak DOIs for datasets are nice, but not terribly useful as its the wrong level of granularity to do anything interesting. Now, DOIs for occurrences, that would be transformative.

  2. DOIs for individual occurrence record could be a real challenge for the DOI infrastructure. Crossref currently has 58million DOIs registered - less than 15% of the ~390 million GBIF records. Another idea floating around is a feature to archive "virtual" occurrence datasets and issue a DOI so they can be cited in publications which make use of these records. That sounds like a good compromise to me

  3. Wasn't aware you guys were getting serious about DOIs. Canadensys is in agreements with DataCite Canada to do the same. Let's talk.

  4. @madoering DOIs are bigger than CrossRef, there are other registry agencies such as DataCite mentioned by @dpsSpiders, and DOIs are built on Handles, which are widely used by digital archives. I guess I'm arguing that just because CrossRef currently has 57M DOIs there's no reason to believe that there's a limit to the number of DOIs that can be minted and resolved.

    Secondly, issuing DOIs for datasets is a poor compromise that assumes data sets are important. Elsewhere I've argued that they aren't Aggregations of data are, in my opinion, typically short-lived (i.e., not much beyond the lifetime of the paper that published the data). If the underlying data are available, people will mix and rematch the data, not the aggregations. And if you have identifiers for aggregations and not individual items, you miss out on linking stuff together, e.g. GBIF and GenBank, GBIF and BHL From my perspective there is huge value in being able to make these links (as well as the side effect of being able to catch the numerous duplicates that GBIF has ).

    I think it's time to take occurrence-level identifiers seriously. There might be ways to prioritise this based on current patterns of citation (e.g., what specimens have been cited in the taxonomic literature or in databases such as GBIF and iBOL). Dataset-level identifiers have their uses, but I feel this is simply putting off the task we really need to tackle

    1. I agree fully that stable occurrence ids are needed so that data can be interlinked properly and this is very much understood here at GBIF.
      A recent check showed that roughly 93% of all our indexed occurrences had stable ids already with the 7% rest being either unstable or true new records coming from the last indexing.

      Obviously we must rely on some stable aspect in the sources to know which record is still the same as before and we fail to do that in some case. We should investigate into those cases more and I sense it will become mostly a social thing, assuring that occurrence publishers follow some best practices. And if they don't I think we should flag those records/datasets as being unstable and indicate clearly that these records should not be used for linking/citing. At least we are not aware of any simple natural identifier for occurrences - if you know one please tell us!

    2. Can you define "stable". Are we talking over the lifetime of GBIF, or between harvesting intervals, or something else?

      My experience has been that some ids simply vanish (e.g., ). I understand that GBIF is highly dependent on data providers not mucking around with their own records, but this message doesn't seem to be getting through to data providers (partly because they've yet to see any value from having stable identifiers).

      It would be useful to have a measure of occurrence id stability along the lines you suggest, especially for someone like me who invests effort in linking museum codes to occurrence ids, only to have those links evaporate on the whim of a data provider.

    3. Yes, I was thinking about stable between harvesting intervals. I guess if we flag datasets or individual records as instable this would eventually also lead to better practices by the publishers

  5. @rdmpage "Secondly, issuing DOIs for datasets is a poor compromise that assumes data sets are important." For occurrence datasets, perhaps. For checklist datasets (the other flavour of DwC-A), they make absolute sense.

    1. Sure, and I'm not really arguing against DOIs for datasets, just that this doesn't solve what for me is the real problem, identifiers for occurrences.

  6. With these discussions, we should probably recognize that not all occurrences really warrant a DOI. Specimens in a NHM are likely to be used and cited, whereas an observation in a garden bird count less likely. A mint on demand service might be worthwhile, also archiving the copy of the record to guard against subsequent changes.

    For the scaling of DataCite, I _believe_ they are built on BDB and have not yet outgrown a single machine, and their minting frequency supports replication over the internet; things that GBIF don't now. I think GBIF have a lot to offer with lessons learnt on the pain of growing indexes beyond 1 machine that might be relevant to that community.

    One benefit of DOI for datasets is really to help people who have used the GBIF services to get aggregate datasets, and provide consistent citation guidelines (data are cited by dataset granularity when used for large scale analysis).