Comments on Developer Blog: The GBIF Registry is now dataset-aware!

With these discussions, we should probably recogni...

2012-11-06T11:29:41.050+01:00

With these discussions, we should probably recognize that not all occurrences really warrant a DOI. Specimens in a NHM are likely to be used and cited, whereas an observation in a garden bird count less likely. A mint on demand service might be worthwhile, also archiving the copy of the record to guard against subsequent changes.

For the scaling of DataCite, I _believe_ they are built on BDB and have not yet outgrown a single machine, and their minting frequency supports replication over the internet; things that GBIF don't now. I think GBIF have a lot to offer with lessons learnt on the pain of growing indexes beyond 1 machine that might be relevant to that community.

One benefit of DOI for datasets is really to help people who have used the GBIF services to get aggregate datasets, and provide consistent citation guidelines (data are cited by dataset granularity when used for large scale analysis).

Yes, I was thinking about stable between harvestin...

2012-11-05T18:02:17.874+01:00

Yes, I was thinking about stable between harvesting intervals. I guess if we flag datasets or individual records as instable this would eventually also lead to better practices by the publishers

Can you define "stable". Are we talking ...

2012-11-05T16:49:50.022+01:00

Can you define "stable". Are we talking over the lifetime of GBIF, or between harvesting intervals, or something else?

My experience has been that some ids simply vanish (e.g., http://iphylo.blogspot.co.uk/2012/07/dear-gbif-please-stop-changing.html ). I understand that GBIF is highly dependent on data providers not mucking around with their own records, but this message doesn't seem to be getting through to data providers (partly because they've yet to see any value from having stable identifiers).

It would be useful to have a measure of occurrence id stability along the lines you suggest, especially for someone like me who invests effort in linking museum codes to occurrence ids, only to have those links evaporate on the whim of a data provider.

Sure, and I'm not really arguing against DOIs ...

2012-11-05T16:41:28.352+01:00

Sure, and I'm not really arguing against DOIs for datasets, just that this doesn't solve what for me is the real problem, identifiers for occurrences.

@rdmpage "Secondly, issuing DOIs for datasets...

2012-11-05T16:27:16.113+01:00

@rdmpage "Secondly, issuing DOIs for datasets is a poor compromise that assumes data sets are important." For occurrence datasets, perhaps. For checklist datasets (the other flavour of DwC-A), they make absolute sense.

I agree fully that stable occurrence ids are neede...

2012-11-05T11:39:20.341+01:00

I agree fully that stable occurrence ids are needed so that data can be interlinked properly and this is very much understood here at GBIF.
A recent check showed that roughly 93% of all our indexed occurrences had stable ids already with the 7% rest being either unstable or true new records coming from the last indexing.

Obviously we must rely on some stable aspect in the sources to know which record is still the same as before and we fail to do that in some case. We should investigate into those cases more and I sense it will become mostly a social thing, assuring that occurrence publishers follow some best practices. And if they don't I think we should flag those records/datasets as being unstable and indicate clearly that these records should not be used for linking/citing. At least we are not aware of any simple natural identifier for occurrences - if you know one please tell us!

@madoering DOIs are bigger than CrossRef, there a...

2012-11-05T08:54:34.282+01:00

@madoering DOIs are bigger than CrossRef, there are other registry agencies such as DataCite mentioned by @dpsSpiders, and DOIs are built on Handles, which are widely used by digital archives. I guess I'm arguing that just because CrossRef currently has 57M DOIs there's no reason to believe that there's a limit to the number of DOIs that can be minted and resolved.

Secondly, issuing DOIs for datasets is a poor compromise that assumes data sets are important. Elsewhere I've argued that they aren't http://iphylo.blogspot.co.uk/2011/04/data-matters-but-do-data-sets.html. Aggregations of data are, in my opinion, typically short-lived (i.e., not much beyond the lifetime of the paper that published the data). If the underlying data are available, people will mix and rematch the data, not the aggregations. And if you have identifiers for aggregations and not individual items, you miss out on linking stuff together, e.g. GBIF and GenBank http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-genbank.html, GBIF and BHL http://iphylo.blogspot.co.uk/2012/02/linking-gbif-and-biodiversity-heritage.html. From my perspective there is huge value in being able to make these links (as well as the side effect of being able to catch the numerous duplicates that GBIF has http://iphylo.blogspot.co.uk/2012/02/how-many-specimens-does-gbif-really.html ).

I think it's time to take occurrence-level identifiers seriously. There might be ways to prioritise this based on current patterns of citation (e.g., what specimens have been cited in the taxonomic literature or in databases such as GBIF and iBOL). Dataset-level identifiers have their uses, but I feel this is simply putting off the task we really need to tackle

Wasn't aware you guys were getting serious abo...

2012-11-03T17:29:42.563+01:00

Wasn't aware you guys were getting serious about DOIs. Canadensys is in agreements with DataCite Canada to do the same. Let's talk. david.shorthouse@umontreal.ca

DOIs for individual occurrence record could be a r...

2012-11-03T15:01:37.434+01:00

DOIs for individual occurrence record could be a real challenge for the DOI infrastructure. Crossref currently has 58million DOIs registered - less than 15% of the ~390 million GBIF records. Another idea floating around is a feature to archive "virtual" occurrence datasets and issue a DOI so they can be cited in publications which make use of these records. That sounds like a good compromise to me

@kylebraak DOIs for datasets are nice, but not ter...

2012-11-03T10:47:54.750+01:00

@kylebraak DOIs for datasets are nice, but not terribly useful as its the wrong level of granularity to do anything interesting. Now, DOIs for occurrences, that would be transformative.