Thursday 20 October 2011

GBIF Portal: Geographic interpretations

The new portal processing is about to go into production, and during testing I was drawing some metrics on the revised geographic interpretation.  It is a simple issue, but many records have coordinates that contradict the country that the record claims to be in.  Some illustrations of this were previously shared by Oliver.

The challenge of this is two fold.  Firstly we see many variations in the country name which needs to be interpreted.  Some examples for Argentina are given (there are 100s of variations per country):

  • Argent.
  • Argentina
  • Argentiana
  • N Argentina
  • N. Argentina
  • ARGENTINA
  • ARGENTINIA
  • ARGENTINNIA
  • "ARGENTINIA"
  • ""ARGENTINIA""
  • etc etc
We have abstracted the parsing code into a separate Java library which makes use of basic algorithms and dictionary files to help interpret the results.  This library might be useful for other tools requiring similar interpretation, or data cleaning efforts, and will be maintained over time as it will be in use in several GBIF tools.

The second challenge is that we need to determine if the point falls within the country.  There is always room for improvement in this area, such as understanding changes over time, but due to the huge volume of outliers when using the raw data a check like this is required.  Our implementation is a very basic reverse georeferencing RESTful web service that takes a latitude and longitude, and returns the proposed country and some basic information such as the title.  Operating the service requires PostGIS and a Java server like Apache Tomcat.  Currently we make use of freely available terrestrial shapefiles, and marine economic exclusion zones.  It would be trivial to expand the service to use more shapefiles for other uses, and is expected to happen over time.  Currently the GBIF service is an internal only processing service, but is expected to be released for public use in the coming months.

Improving the country name interpretation and making use of a more accurate geospatial verification service than previously will help improve data reporting at the national level using the GBIF portal as indicated here.

# Records # Georeferenced
Argentina Previously 665,284 284,012
Now 680,344 303,889
United States Previously 79,432,986 68,900,415
Now 81,483,086 70,588,182



No comments:

Post a Comment