Developer Blog: The GBIF Registry is now dataset-aware!

Monday, 29 October 2012

The GBIF Registry is now dataset-aware!

This post continues the series of posts that highlight the latest updates on the GBIF Registry.

To recap, in April 2011 Jose Cuadra wrote The evolution of the GBIF Registry, a post that provided a background to the GBIF Network, explained how Network entities are now stored in a database instead of UDDI system, and how it has a new web application and API.

Then a month later, Jose wrote another post entitled 2011 GBIF Registry Refactoring that was more technical in nature and detailed a new set of technologies chosen to improve the underlying codebase.

Now even if you have been keeping an eye on the GBIF Registry, you probably missed the most important improvement that happened in September 2012: the Registry is now dataset-aware!

To be dataset-aware, means that the Registry is now aware of all the datasets that exist behind DiGIR and BioCASE endpoints. Just in case the reader isn't aware, DiGIR and BioCASE are wrapper tools used by organizations in the GBIF Network to publish their datasets. The datasets are exposed via an endpoint URL, and there can potentially be thousands of datasets behind a single endpoint.

Traditionally, the GBIF Registry knew about the endpoint but not about its datasets. It was then the job of GBIF's Harvesting and Indexing Toolkit (HIT) to discover what datasets existed behind the endpoint, harvest all their records, and index those records into the GBIF Data Portal.

Therefore if you ever visited the GBIF Data Portal and viewed the Portal page for the Academy of Natural Sciences, you would find that it has 3 datasets.

Clicking on each one, reveals that they are all exposed via the same DiGIR endpoint (see "Access point URL") - see below:

But, if you visited the GBIF Registry and did the same search for the Academy of Natural Sciences, prior to the Registry being dataset-aware, you would have seen it has a DiGIR endpoint, but not found it has any datasets!

Now that the GBIF Registry is dataset-aware, however, the Registry page for the Academy of Natural Sciences shows that the organization owns 3 datasets, and has a (DiGIR) Technical Installation.

So that's fantastic, now the GBIF Registry knows about 1000s of datasets that only the GBIF Data Portal knew about before. But how was dataset-awareness achieved?

First, the Registry now does the job of dataset discovery that the HIT used to do. A project called the registry-metadata-sync was created to do this.

Second, a special set of scripts was written to migrate all the datasets from the GBIF Data Portal index database, into the Registry database. For the first time, all datasets that existed in the GBIF Data Portal now exist in the GBIF Registry, and can be uniquely identified by their GBIF Registry UUID!

Third, the HIT was branched, creating a revised version of the tool that was able to understand the new dataset-aware Registry. The HIT also had to be modified to allow its operators to still trigger dataset discovery by technical installation. Life just got easier for the HIT though, since it could use each dataset's GBIF Registry UUID to uniquely identify each dataset during indexation.

Indeed, the dataset-aware Registry allocates a UUID to each dataset. This is fundamentally the biggest advantage that the dataset-aware Registry brings. Now that GBIF has succeeded in uniquely identifying each Dataset in its Registry, it is now working to assign each Dataset a Globally Unique Identifier (GUID) in the form of a Digital Object Identifier (DOI). The DOI for a dataset will be resolvable back to the GBIF Registry, and could be referenced when citing a Dataset, thereby enabling better tracking of Dataset usage in scientific publications.

GBIF is really excited about being able to provide publishers a DOI for each of their dataset. Keep an eye on our Registry in the coming months for their grand appearance.

10 comments:

Roderic Page3 November 2012 at 10:47
@kylebraak DOIs for datasets are nice, but not terribly useful as its the wrong level of granularity to do anything interesting. Now, DOIs for occurrences, that would be transformative.
ReplyDelete
Replies
Unknown3 November 2012 at 15:01
DOIs for individual occurrence record could be a real challenge for the DOI infrastructure. Crossref currently has 58million DOIs registered - less than 15% of the ~390 million GBIF records. Another idea floating around is a feature to archive "virtual" occurrence datasets and issue a DOI so they can be cited in publications which make use of these records. That sounds like a good compromise to me
ReplyDelete
Replies
David Shorthouse3 November 2012 at 17:29
Wasn't aware you guys were getting serious about DOIs. Canadensys is in agreements with DataCite Canada to do the same. Let's talk. david.shorthouse@umontreal.ca
ReplyDelete
Replies
Roderic Page5 November 2012 at 08:54
@madoering DOIs are bigger than CrossRef, there are other registry agencies such as DataCite mentioned by @dpsSpiders, and DOIs are built on Handles, which are widely used by digital archives. I guess I'm arguing that just because CrossRef currently has 57M DOIs there's no reason to believe that there's a limit to the number of DOIs that can be minted and resolved.

Secondly, issuing DOIs for datasets is a poor compromise that assumes data sets are important. Elsewhere I've argued that they aren't http://iphylo.blogspot.co.uk/2011/04/data-matters-but-do-data-sets.html. Aggregations of data are, in my opinion, typically short-lived (i.e., not much beyond the lifetime of the paper that published the data). If the underlying data are available, people will mix and rematch the data, not the aggregations. And if you have identifiers for aggregations and not individual items, you miss out on linking stuff together, e.g. GBIF and GenBank http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-genbank.html, GBIF and BHL http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-biodiversity-heritage.html. From my perspective there is huge value in being able to make these links (as well as the side effect of being able to catch the numerous duplicates that GBIF has http://iphylo.blogspot.co.uk/2012/02/how-many-specimens-does-gbif-really.html ).

I think it's time to take occurrence-level identifiers seriously. There might be ways to prioritise this based on current patterns of citation (e.g., what specimens have been cited in the taxonomic literature or in databases such as GBIF and iBOL). Dataset-level identifiers have their uses, but I feel this is simply putting off the task we really need to tackle
ReplyDelete
Replies
David Shorthouse5 November 2012 at 16:27
@rdmpage "Secondly, issuing DOIs for datasets is a poor compromise that assumes data sets are important." For occurrence datasets, perhaps. For checklist datasets (the other flavour of DwC-A), they make absolute sense.
ReplyDelete
Replies
Tim Robertson6 November 2012 at 11:29
With these discussions, we should probably recognize that not all occurrences really warrant a DOI. Specimens in a NHM are likely to be used and cited, whereas an observation in a garden bird count less likely. A mint on demand service might be worthwhile, also archiving the copy of the record to guard against subsequent changes.

For the scaling of DataCite, I _believe_ they are built on BDB and have not yet outgrown a single machine, and their minting frequency supports replication over the internet; things that GBIF don't now. I think GBIF have a lot to offer with lessons learnt on the pain of growing indexes beyond 1 machine that might be relevant to that community.

One benefit of DOI for datasets is really to help people who have used the GBIF services to get aggregate datasets, and provide consistent citation guidelines (data are cited by dataset granularity when used for large scale analysis).
ReplyDelete
Replies

Add comment

Monday, 29 October 2012

The GBIF Registry is now dataset-aware!

10 comments:

About this blog

Twitter

Contributors

Our favourite blogs

Blog Archive

Tags

Followers

Monday, 29 October 2012

The GBIF Registry is now dataset-aware!

10 comments:

About this blog

Twitter

Contributors

Our favourite blogs

Blog Archive

Subscribe To

Tags

Followers