Tuesday, 4 December 2018

Goodbye developer blog, hello data-blog!





GBIF has a new blog!



What is it?

A place for GBIF staff and guest bloggers to contribute:
  • Statistics 
  • Graphs 
  • Tutorials 
  • Ideas 
  • Opinions 

Who can contribute?

If you would like to contribute you can contact jwaller@gbif.org. Guest blogs are very welcome.

How can I write a post?

There is a short turtorial on the blog github.

What about the developer blog?

The developer blog will remain up as an archive, but there are no plans to actively post new content here.


Friday, 27 July 2018

How popular is your favorite species?






How to use

Use the box to the left to type in the species you are interested in.
Make sure to use a scientific name:
  • Aves instead of birds
  • Plantae instead of plants
  • Anura  instead of frogs

Explanation of tool

This tool plots the downloads through time for species or other taxonomic groups with more than 25 downloads at GBIF. Downloads at GBIF most often occur through the web interface. In a previous post, we saw that most users are downloading data from GBIF via filtering by scientific name (aka Taxon Key). Since the GBIF index currently sits at over 1 billion records (a 400+GB csv), most users will simply filter by their taxonomic group of interest and then generate a download.

How to bookmark a result?

If you would like to bookmark a result or graph to share with others, you can visit app page direcly: app link. On this page the state of the app will be saved inside the url. You can also save a jpg by clicking on the little sandwich in the top right.

What counts as a download?

For the graphs above, I decided that it would be more meaningful to roll up downloads below the queried taxonomic level.
  • If a user downloaded 5 different bird species at once, this would count as 1 download for Aves and 1 download for each of the species downloaded.
  • If a user only typed in Aves in the occurrence download interface and not any other species. This would only count as 1 download for Aves and 0 downloads for all bird species.
  • Similarly, if a user only typed the order Passeriformes into the search, this would count as 1 download for Passeriformes and 1 download for Aves (and 1 download for Animalia ect.) but 0 downloads for all the species, families, and genera within Passeriformes.
It is possible, but not as easy, to get data from GBIF without generating a download. In fact users can stream data using the GBIF occurrence api without ever generating a download. Currently users can “download” 200k-long chunks of occurrence data without generating a download by using the api. If someone got their data using the api in this way, we would not be able to track it currently. Presumably, the vast majority of users are getting their data directly through the web interface.

For more technical details on this tool, you can visit my personal blog:
http://www.johnwalleranalytics.org/2018/07/06/gbif-download-trends/




Thursday, 28 June 2018

Occurrence Downloads

Occurrences at GBIF are often downloaded through the web interface, or through the api (via rgbif ect.). Users can place various filters on the data in order to limit the number of records returned. As the occurrence index is currently a 447 GB csv, most users want to use a filter.

Total monthly downloads

Here I plot the total monthly downloads for various popular filters. For the past few years, GBIF has be averaging around 10k downloads per month.

Two peaks in total downloads stand out:
  • Mar 2014
  • Sep 2016
The Sep 2016 peak seems to be explained by high DATASET_KEY downloads. Both the Mar 2014 and Sep 2016 peaks are well explained by the top users. Top users in this graph are all the downloads generated by the top 3 most active users on GBIF. These users generate downloads in the 1000s and are most likely to be automated downloads generated internally.

One interesting detail is that while No Filter Used is not used very often it accounts for more than 500 billion occurrence records downloaded.

Finally, if we look at the number of unique users (un-select everything else to see in isolation), we see that the number of individuals making downloads on GBIF has been increasing steadily with some perhaps interesting cyclical patterns. The graph below is interactive. You can see different data views by clicking on the names. 


Popular filters explained

There are many ways that a user can filter data. The types and combinations of filters are almost limitless. Below I describe some of the most common filters:

1. TAXON_KEY

This is one of the most common filters users place on the GBIF occurrence index. Users can either choose one or many taxon names to filter the data, and users can choose any taxon rank they want (species, genus, family, kingdom ect.).

2. COUNTRY

Here users can return records only from a certain country. This is the country the user searched and not where user is searching from.

3. HAS_GEOSPATIAL_ISSUE

Here users can specify that they want occurrence records without some interpreted error.

4. HAS_COORDINATE

Here users can say that they want occurrence records that have coordinates.

5. No Filter

Finally, a surprising number of users never put any filter and instead request to download the entire occurrence index. In the overwhelming majority of cases, we have to assume these users have done this by mistake.

You can read more about downloads at GBIF here:
http://www.johnwalleranalytics.org/2018/05/30/gbif-download-statistics/