Tuesday 4 December 2018

Goodbye developer blog, hello data-blog!





GBIF has a new blog!



What is it?

A place for GBIF staff and guest bloggers to contribute:
  • Statistics 
  • Graphs 
  • Tutorials 
  • Ideas 
  • Opinions 

Who can contribute?

If you would like to contribute you can contact jwaller@gbif.org. Guest blogs are very welcome.

How can I write a post?

There is a short turtorial on the blog github.

What about the developer blog?

The developer blog will remain up as an archive, but there are no plans to actively post new content here.


Friday 27 July 2018

How popular is your favorite species?






How to use

Use the box to the left to type in the species you are interested in.
Make sure to use a scientific name:
  • Aves instead of birds
  • Plantae instead of plants
  • Anura  instead of frogs

Explanation of tool

This tool plots the downloads through time for species or other taxonomic groups with more than 25 downloads at GBIF. Downloads at GBIF most often occur through the web interface. In a previous post, we saw that most users are downloading data from GBIF via filtering by scientific name (aka Taxon Key). Since the GBIF index currently sits at over 1 billion records (a 400+GB csv), most users will simply filter by their taxonomic group of interest and then generate a download.

How to bookmark a result?

If you would like to bookmark a result or graph to share with others, you can visit app page direcly: app link. On this page the state of the app will be saved inside the url. You can also save a jpg by clicking on the little sandwich in the top right.

What counts as a download?

For the graphs above, I decided that it would be more meaningful to roll up downloads below the queried taxonomic level.
  • If a user downloaded 5 different bird species at once, this would count as 1 download for Aves and 1 download for each of the species downloaded.
  • If a user only typed in Aves in the occurrence download interface and not any other species. This would only count as 1 download for Aves and 0 downloads for all bird species.
  • Similarly, if a user only typed the order Passeriformes into the search, this would count as 1 download for Passeriformes and 1 download for Aves (and 1 download for Animalia ect.) but 0 downloads for all the species, families, and genera within Passeriformes.
It is possible, but not as easy, to get data from GBIF without generating a download. In fact users can stream data using the GBIF occurrence api without ever generating a download. Currently users can “download” 200k-long chunks of occurrence data without generating a download by using the api. If someone got their data using the api in this way, we would not be able to track it currently. Presumably, the vast majority of users are getting their data directly through the web interface.

For more technical details on this tool, you can visit my personal blog:
http://www.johnwalleranalytics.org/2018/07/06/gbif-download-trends/




Thursday 28 June 2018

Occurrence Downloads

Occurrences at GBIF are often downloaded through the web interface, or through the api (via rgbif ect.). Users can place various filters on the data in order to limit the number of records returned. As the occurrence index is currently a 447 GB csv, most users want to use a filter.

Total monthly downloads

Here I plot the total monthly downloads for various popular filters. For the past few years, GBIF has be averaging around 10k downloads per month.

Two peaks in total downloads stand out:
  • Mar 2014
  • Sep 2016
The Sep 2016 peak seems to be explained by high DATASET_KEY downloads. Both the Mar 2014 and Sep 2016 peaks are well explained by the top users. Top users in this graph are all the downloads generated by the top 3 most active users on GBIF. These users generate downloads in the 1000s and are most likely to be automated downloads generated internally.

One interesting detail is that while No Filter Used is not used very often it accounts for more than 500 billion occurrence records downloaded.

Finally, if we look at the number of unique users (un-select everything else to see in isolation), we see that the number of individuals making downloads on GBIF has been increasing steadily with some perhaps interesting cyclical patterns. The graph below is interactive. You can see different data views by clicking on the names. 


Popular filters explained

There are many ways that a user can filter data. The types and combinations of filters are almost limitless. Below I describe some of the most common filters:

1. TAXON_KEY

This is one of the most common filters users place on the GBIF occurrence index. Users can either choose one or many taxon names to filter the data, and users can choose any taxon rank they want (species, genus, family, kingdom ect.).

2. COUNTRY

Here users can return records only from a certain country. This is the country the user searched and not where user is searching from.

3. HAS_GEOSPATIAL_ISSUE

Here users can specify that they want occurrence records without some interpreted error.

4. HAS_COORDINATE

Here users can say that they want occurrence records that have coordinates.

5. No Filter

Finally, a surprising number of users never put any filter and instead request to download the entire occurrence index. In the overwhelming majority of cases, we have to assume these users have done this by mistake.

You can read more about downloads at GBIF here:
http://www.johnwalleranalytics.org/2018/05/30/gbif-download-statistics/

Thursday 22 June 2017

GBIF Name Parser

The GBIF name parser has been a fundamental library for GBIF to parse a scientific name string into a structured representation of a name. It has been refined over many years based on actual name strings encountered in the GBIF occurrence and checklist indices. Over the years the major design goals have not changed much and can be summarised as follows:
  • extract canonical, code relevant name parts
    • populate only the ParsedName class of the GBIF API
    • ignore any superflous name parts irrelevant to the code, e.g. species authorships in infraspecific names, infrageneric placements of species or superflous infraspecific parts in quadrinomials
  • deal with a wide variety of names that the ParsedName class can represent
    • cultivar names
    • bacterial strains & candidate names
    • virus names
    • named hybrids
    • taxon concept references, sensu latu/strictu or aggregates
    • legacy ranks
  • extract notes often found in names:
    • nomenclatural remarks
    • determination notes like aff. 
    • partially determined species, e.g. only down to the genus: Abies spec.
  • in case author parsing is impossible, fallback to parsing just the canonical name without authors
  • allow slightly imperfect names not strictly well formed according to the rules
  • classify names according to our NameType enumeration
Compared to gnparser these are slightly different goals explaining some of the behavior explained in the recent paper from Dmitry Mozzherin 2017. As that paper explains the GBIF name parser is based on regular expressions, some of them even recursive. This is not the reason why we do not support hybrid formulas though. Hybrid formulas (e.g. Quercus robur x Q. macrocarpa) as opposed to named hybrids (e.g. Quercus x turneri) are a variable combination of names and thus are very different to the Linnean names represented by a ParsedName. For name matching, backbone building and many more problems hybrid formulas are incompatible and we instead decided to deal with hybrid formulas just as with other unparsable viruses or OTU names that do not follow the neat structure of Linnean names. We simply keep the entire string as it was, classify it with a NameType and do not further parse it.

GBIF exposes the name parser through the GBIF JSON API, here are some examples for illustration:
Authorships are not (yet) parsed into a list of individual authors. This has been done internally already and it is something we are likely to expose in the future. Currently the authorship is parsed into four pieces, the authorship and year for the combination and basionym.

gnparser in GBIF

The GNA name parser is a great parser for well formed names. It has slightly different goals, but since it is available for the JVM we have wrapped it to support the GBIF NameParser interface producing ParsedName instances. Wrapping the Scala based gnaparser was not as trivial as we had thought due to its different parsing output and the mismatch of Scala and Java collections, but working against the NameParser interface you can finally select your parser of choice.

The authorship semantics for original names are also slightly different between the two parsers. Again some examples to illustrate the difference:
Azalea schlippenbachii (Maxim.) Kuntze
Both parsers show the same semantics:
GBIF:
"authorship": "Kuntze",
"bracketAuthorship": "Maxim.",

GNA:
"value": "(Maxim.) Kuntze",
"basionym_authorship": {
  "authors": ["Maxim."]
},
"combination_authorship": {
  "authors": ["Kuntze"]
}
Rhododendron schlippenbachii Maxim.
The GBIF parser places the author into “authorship” as the author of the very combination.
The gnparser places the author into the basionym authorship instead even though it is not surrounded by brackets.
As the parser cannot know if the name actually is a basionym, i.e. there indeed exists a subsequenct recombination, this was slightly unexpected
and the GBIF NameParser wrapper had to swap the authorship to populate a ParsedName in such cases:
GBIF:
"authorship": "Maxim.",

GNA:
"basionym_authorship": {
  "authors": ["Maxim."]
}
Puma concolor (Linnaeus, 1771)
Both parsers show the same semantics:
GBIF:
"bracketAuthorship": "Linnaeus",
"bracketYear": "1771",

GNA:
"basionym_authorship": {
  "authors": ["Linnaeus"],
  "year": {
    "value": "1771"
  }
}
Ex authors are not parsed in GBIF (yet). They do cause a little issue as the sequence of authors varies between botany and zoology. In botany, the author of the earlier name precedes the later, valid one while in zoology this is reversed. GNA follows the zoological model, even though far more usages of ex-authors can be found in botanical names.

Uninomials are also treated differently. GBIF uses a single property genusOrAbove for both the genus part of a binomial, a standalone genus or a uninomal of a rank above genus. GNA places genera from a binomial into the genus field, but uses uninomial for a standalone genus name.

Performance

We are still comparing gnparser with the GBIF name parser, but initial tests using gnparser-0.4.0 to parse 1380 names from our unit tests suggests the GBIF parser is up to twice as fast running in a Java environment. There is an overhead of wrapping the gnparser for the ParsedName result class. But even if we just parse the names and do not convert the Scala result into a ParsedName it takes up 75% more time:
  
Total time parsing 1380 names
MacBookPro 2017, Java8, single thread:

  GBIF: 1331ms
  GNA : 2596ms
  GNA-: 2323ms # without wrapper

This contradicts the results presented in the gnaparser paper, but might be related to the selection of names or running the parser in different environments.

Future

We are working with GNA to improve both parsers and align them more. With slightly different goals it might be hard to fully merge the two projects, but we will try to unify the efforts as much as we can. For the GBIF name parser we will be adding parsed author and ex author teams in the near future. This is needed to do author comparisons for better name matching in the GBIF backbone building (where it already exists) and the Catalogue of Life.

Monday 27 February 2017

GBIF Backbone - February 2017 Update

We are happy to annouce that a new GBIF Backbone just went live, available also as an improved Darwin Core Archive for download. Here are some facts highlighting the important changes.

New source datasets

Apart from continuously updated source like the Catalog of Life or WoRMS here are the new datasets we used as a source to build the backbone.




The 43 sources used in this backbone build

Code changes




All other fixed issues in the source code that generates the backbone can be found in our Jira epic
and github milestone.

Backbone impact

The new backbone has a total of 5,887,500 names of which it treats 2,818,534 species names as accepted (up from 5,307,978 and 2,525,274 respectively).
More backbone metrics are available through our portal and in more detail through our API.


  • 105,296 deleted names, many of them previous erroneous duplicates
  • 685,853 new names
    • Animalia: 164 families; 6,616 genera; 257,196 species; 87,660 infraspecific
    • Archaea: 2 families; 6 genera; 48 species
    • Bacteria: 27 families; 225 genera; 2,470 species; 615 infraspecific
    • Chromista: 2 phyla; 13 classes; 58 order; 54 families; 767 genera; 12,124 species; 2,953 infraspecific
    • Fungi:  2 families; 269 genera; 8,703 species; 2,993 infraspecific
    • Plantae: 3 families; 795 genera; 63,617 species; 33,282 infraspecific
    • Protozoa: 4 families; 65 genera; 1,412 species; 280 infraspecific
    • Viruses: 8 families; 1,227 genera; 8,488 species
    • Unknown: 4 families; 2,708 genera; 13,076 species; 2,237 infraspecific

A very large and detailed log of the backbone build is also available.

The largest taxonomic groups in the backbone, exceeding 3% of all accepted species is shown in the following diagram:


All contributors to the backbone arranged by number of names the source serves as the primary reference:


Occurrence impact

With a new backbone we have reprocessed all of our 712 million occurrences.

The distribution of the major taxonomic groups exceeding 3%, i.e have a minimum of 36.800 species, is shown in this last diagram:


The 1,226,520 accepted species in GBIF occurrences (140 less than before) represent 44% of all accepted backbone species.

Wednesday 25 January 2017

Sampling-event standard takes flight on the wings of butterflies


Data collected from systematic monitoring schemes is highly valuable. That's because harvesting species data from a given set of sites repeatedly over time using a well-defined sampling effort opens the door to key ecological analyses including phenology, population trends, changes in community structure and other metrics related to a range of Essential Biodiversity Variables (EBVs).

A couple of years ago there was no faithful way to universally standardize data from systematic monitoring schemes. This meant that researchers using this kind of data would need to spend a lot of time deciphering it first. Their job would get even more complicated when trying to integrate data from various heterogeneous sources, each storing their data in different formats, units, etc.

Today, the situation looks much better thanks to a massive collaboration between GBIF, EU BON partners and the wider biodiversity community whose aim was to enable sharing of "sampling-event datasets".  

Indeed, one of the most successful outcomes from this collaboration has been the development of a standardized format for systematic butterfly monitoring schemes.

The format has been developed in close collaboration with the EU BON partners Israel Pe'er (GlueCAD- Biodiversity IT) and his son, Dr. Guy Pe'er, (UFZ), who works with systematic monitoring data.  The format can be adapted to many other types of systematic monitoring, for many taxonomic groups, as it ensures the following important conditions for researchers are met:
  • all visits to a given site are known, including those with no sightings, as this allows for analyses of species phenology, etc.
  • the range of species being recorded during sampling is explicit, as this allows for true absence to be determined.
  • the location hierarchies can be specified (e.g. the location is a fixed transect or subsection of a transect), as this allows users to group observations by location.
  • enough detailed information about the sampling effort and sampling area (e.g. units of measurement) are captured, as this allows users to calculate density or convert between units of abundance.
The Israeli Butterfly Systematic Monitoring Scheme (BMS-IL) dataset has already been published openly using this format. I'd like to invite everyone to explore this exemplar dataset from either the EU BON IPT or via GBIF.org.

In the future, I hope that GEO BON's Guidelines for Standardized Global Butterfly Monitoring will incorporate a new recommendation that all monitoring programs use this standardized format for sharing their data. Without a doubt this will make researchers' jobs easier when integrating data from several butterfly monitoring programs for their analyses. It will also enable integrating the data with standardized sampling-event data from other disciplines as well.

Ideally, making the data openly available in a standardized format also leads to new collaboration. So far, BMS-IL data has been used to assess trends in the abundance and phenology of Israel's butterflies for the benefit of conservation or climate change research for example. I would like to encourage you to reach out to Israel and Guy Pe'er if you have any novel ideas on how to reuse their newly standardized data in order to help unlock its full potential.

Thursday 12 January 2017

IPT v2.3.3 - Your repository for standardized biodiversity data


GBIF is pleased to announce the release of IPT v2.3.3, now available for download from the IPT website.

This version looks and feels the same as 2.3.2 but is much more robust and secure. I'd like to recommend that all existing IPT installations be upgraded as soon as possible following the instructions listed in the release notes.

Additionally, a couple new strategic features have been added to the tool to enhance its potential. A description of these new features follows below.

Improved dataset homepage


Compared with general-purpose repositories such as Dryad or Figshare, the IPT ensures that uploaded biodiversity data gets disseminated in a standardized format (Darwin Core Archive - DwC-A), facilitating wider reuse and enabling the data to be indexed by aggregators such as GBIF.org.

Interoperability comes at a small cost though, as depositors choosing to use the IPT must overcome a learning curve in understanding how to map their data to the Darwin Core standard.

To make this easier for depositors, a new set of Darwin Core Excel templates have recently been released. These new templates provide a simpler solution for capturing, formatting and uploading data to the IPT.

Similarly, users of the standardized data need to understand how to unpack a DwC-A and make sense of the data inside.  

Data Records section - RLS Global Reef Fish Dataset
doi:10.15468/qjgwba
To make this process easier for users, a new Data Records section has been added to the dataset homepage that provides an explanation of what the DwC-A format is with a graphic illustration showing the number of records in each file contained within it.

Overall this advancement will strengthen the IPT as a data repository, which is already capable of assigning DOIs to datasets to make them discoverable and citable. 

Translation into Russian 


Map of IPT installations in Russia - January 2017
Installed in 52 countries around the world, use of the IPT heavily is underrepresented across Russian speaking countries. Therefore to extend the IPT's reach in these areas, the user interface has been fully translated into Russian by a team of volunteer translators with the largest contribution made by Ivan Chadin from the Komi Science Centre of the Ural Branch of the Russian Academy of Sciences.

Map of data published by Russia - January 2017
At the time of writing there were already 18 datasets from Russia published by 5 IPTs installed across Pushchino, Moscow, St Petersburg and the Komi Republic. It will be exciting to watch this number grow over time in part thanks to this enormous volunteer contribution.



Acknowledgements


Once again I'd like to recognize all the volunteer translators that contributed their time and expertise to making this new version available in seven different languages:
  • Sophie Pamerlon (GBIF France) - Updating French translation
  • Yukiko Yamazaki (GBIF Japan (JBIF)) - Updating Japanese translation
  • Daniel Lins (Universidade de São Paulo, Research Center on Biodiversity and Computing - BioComp) - Updating Portuguese translation
  • Néstor Beltrán (Colombian Biodiversity Information System (SiB Colombia)) - Updating Spanish translation
  • Ivan Chadin (Institute of Biology of Komi Scientific Centre of the Ural Branch of the Russian Academy of Sciences), Max Shashkov (Institute of Physicochemical and Biological Problems in Soil Science, Russian Academy of Science) and Artyom Leostrin (Komarov Botanical Institute of the Russian Academy of Sciences (Saint-Petersburg)) - Adding Russian translation 
I'd also like to recognize a few volunteers that helped make significant improvements to the IPT codebase:
  • Bruno P. Kinoshita (National Institute of Water and Atmospheric Research (NIWA)) - Fixed issue #1241, ensuring the IPT can be installed on a server behind a proxy
  • Pieter Provoost (UNESCO) - Fixed issue #1248, improving the IPT's RSS feed
  • Tadj Youssouf (Security researcher, fb.com/oc3f.dz) - Helped address a cross site scripting issue
Although the core development of the IPT happens at the GBIF Secretariat, the coding, documentation, and internationalization are a community effort and everyone is welcome to join in.

I look forward to seeing the IPT's community of volunteers and users continue to grow and hope you can unlock the full potential of this publishing tool and repository.