Monday 29 October 2012

The GBIF Registry is now dataset-aware!


This post continues the series of posts that highlight the latest updates on the GBIF Registry.

To recap, in April 2011 Jose Cuadra wrote The evolution of the GBIF Registry, a post that provided a background to the GBIF Network, explained how Network entities are now stored in a database instead of UDDI system, and how it has a new web application and API.  

Then a month later, Jose wrote another post entitled 2011 GBIF Registry Refactoring that was more technical in nature and detailed a new set of technologies chosen to improve the underlying codebase.

Now even if you have been keeping an eye on the GBIF Registry, you probably missed the most important improvement that happened in September 2012: the Registry is now dataset-aware! 

To be dataset-aware, means that the Registry is now aware of all the datasets that exist behind DiGIR and BioCASE endpoints. Just in case the reader isn't aware, DiGIR and BioCASE are wrapper tools used by organizations in the GBIF Network to publish their datasets. The datasets are exposed via an endpoint URL, and there can potentially be thousands of datasets behind a single endpoint. 

Traditionally, the GBIF Registry knew about the endpoint but not about its datasets. It was then the job of GBIF's Harvesting and Indexing Toolkit (HIT) to discover what datasets existed behind the endpoint, harvest all their records, and index those records into the GBIF Data Portal

Therefore if you ever visited the GBIF Data Portal and viewed the Portal page for the Academy of Natural Sciences, you would find that it has 3 datasets. 



Clicking on each one, reveals that they are all exposed via the same DiGIR endpoint (see "Access point URL") - see below:










































But, if you visited the GBIF Registry and did the same search for the Academy of Natural Sciences, prior to the Registry being dataset-aware, you would have seen it has a DiGIR endpoint, but not found it has any datasets!









Now that the GBIF Registry is dataset-aware, however, the Registry page for the Academy of Natural Sciences shows that the organization owns 3 datasets, and has a (DiGIR) Technical Installation. 
   

  








So that's fantastic, now the GBIF Registry knows about 1000s of datasets that only the GBIF Data Portal knew about before. But how was dataset-awareness achieved? 

First, the Registry now does the job of dataset discovery that the HIT used to do. A project called the registry-metadata-sync was created to do this. 

Second, a special set of scripts was written to migrate all the datasets from the GBIF Data Portal index database, into the Registry database. For the first time, all datasets that existed in the GBIF Data Portal now exist in the GBIF Registry, and can be uniquely identified by their GBIF Registry UUID!

Third, the HIT was branched, creating a revised version of the tool that was able to understand the new dataset-aware Registry. The HIT also had to be modified to allow its operators to still trigger dataset discovery by technical installation. Life just got easier for the HIT though, since it could use each dataset's GBIF Registry UUID to uniquely identify each dataset during indexation. 




Indeed, the dataset-aware Registry allocates a UUID to each dataset. This is fundamentally the biggest advantage that the dataset-aware Registry brings. Now that GBIF has succeeded in uniquely identifying each Dataset in its Registry, it is now working to assign each Dataset a Globally Unique Identifier (GUID) in the form of a Digital Object Identifier (DOI). The DOI for a dataset will be resolvable back to the GBIF Registry, and could be referenced when citing a Dataset, thereby enabling better tracking of Dataset usage in scientific publications.

GBIF is really excited about being able to provide publishers a DOI for each of their dataset. Keep an eye on our Registry in the coming months for their grand appearance.   

Wednesday 17 October 2012

IPT v2.0.4 released


Today the GBIF Secretariat has announced the release of version 2.0.4 of the Integrated Publishing Toolkit (IPT). For those who can't wait to get their hands on the release, it's available for download on the project website here.

Collaboration on this version was more global than ever before, with volunteers in Latin America, Asia, and Europe contributing translations, and volunteers in Canada and the United States contributing some patches. 

Add to that all the issue activity, things have been busy. In total 108 issues were addressed in this version; 38 Defects, 35 Enhancements, 7 Other, 5 Patches, 18 Won't fix, 4 Duplicates, and 1 that was considered as Invalid. These are detailed in the issue tracking system.

So what exactly has changed and why? Here's a quick rundown.

One thing that kept coming up again and again in version 2.0.3, was that users were unwittingly installing the IPT in test mode, thinking that they were running in production. After registering a resource, these users expected to see it show up in the GBIF Registry and ultimately be indexed by GBIF. Frustrated emails were then sent to the GBIF Helpdesk when nothing happened. Sadly the reply from the GBIF Helpdesk was always filled with the same disappointing news: 

"Your resource is actually in the Test Registry therefore it will never be indexed by GBIF. Oh, and you will have to reinstall your IPT using production mode next time and do your resource configuration over again!" 

So to tackle this problem, the setup pages have been improved to make it crystal clear what it means to choose one mode or the other. 



The UI has also been branded when running in test mode to make it even more obvious what mode the IPT is running in.  

  
Now whether or not test mode was chosen accidentally, it can be used to help train administrators how to configure an instance, and to help train users how to publish resources. What was always missing, was a way to transfer configured resources from an IPT in test mode, to one in production. 

I'm happy to say that in 2.0.4, a resource can now be easily transferred between 2 IPTs including all its source files and mappings. Users will be happy to know that they never have to waste time reconfiguring the same resource from scratch. How is this done? Well in short, resource transfer is achieved by uploading an archived IPT resource folder during resource creation - see user manual for full instructions. 

Moving on.. 

With so many publishers opting for the convenience of publishing via the IPT, the GBIF helpdesk has been receiving dozens of requests to replace an existing DiGIR, BioCASE, or TAPIR resource in the GBIF Registry with one coming from their IPT. To facilitate resource migration, another new feature was added in 2.0.4 that allows the IPT to update an existing resource in the GBIF Registry during registration. The change is welcomed most of all by the GBIF helpdesk who bore the brunt of carrying out resource migrations in the GBIF Registry. See User Manual for instructions.

Thanks to the Taiwan Biodiversity Information Facility (TaiBIF)  the IPT interface is now available in Traditional Chinese. That makes the IPT available in a total of 4 languages now including French, Spanish and of course English. 




What else? 

Thanks to a patch from Peter Desmet, download metrics for the Archive, EML, and RTF files can now be tracked via Google Analytics. For IPT admins who aren't already tracking analytics, there are simple instructions in the User Manual. Here's a screenshot showing some metrics from http://ipt-rc.gbif.org For your reference, the "Event Label" is the resource short name in the IPT.

Last but not least, it should be highlighted that the IPT's RSS feed is now updated every time a resource is published. The version number is displayed right beside the resource name, so subscribers can stay on top of the latest changes. Here's a screenshot from my RSS reader pulling from http://ipt.gbif.org/rss.do



And that about wraps up the most important changes in this version. 


As always, we’d like to give special thanks to the volunteer translators for their time and efforts: 
  • Nicolas Noé (Belgian Biodiversity Platform, Belgium) - French 
  • TaiBIF, Taiwan - Traditional Chinese
  • Laura Roldan Gomez, Dairo Escobar, and Daniel Amariles, (Colombian Biodiversity Information System (SiB)) - Spanish
Plus another couple of special mentions are owed to Peter Desmet and Laura Russell who provided an exceptional amount of feedback and suggestions. 

On behalf of the GBIF development team, I hope you enjoy using latest version.