The challenge of this is two fold. Firstly we see many variations in the country name which needs to be interpreted. Some examples for Argentina are given (there are 100s of variations per country):
- Argent.
- Argentina
- Argentiana
- N Argentina
- N. Argentina
- ARGENTINA
- ARGENTINIA
- ARGENTINNIA
- "ARGENTINIA"
- ""ARGENTINIA""
- etc etc
We have abstracted the parsing code into a separate Java library which makes use of basic algorithms and dictionary files to help interpret the results. This library might be useful for other tools requiring similar interpretation, or data cleaning efforts, and will be maintained over time as it will be in use in several GBIF tools.
The second challenge is that we need to determine if the point falls within the country. There is always room for improvement in this area, such as understanding changes over time, but due to the huge volume of outliers when using the raw data a check like this is required. Our implementation is a very basic reverse georeferencing RESTful web service that takes a latitude and longitude, and returns the proposed country and some basic information such as the title. Operating the service requires PostGIS and a Java server like Apache Tomcat. Currently we make use of freely available terrestrial shapefiles, and marine economic exclusion zones. It would be trivial to expand the service to use more shapefiles for other uses, and is expected to happen over time. Currently the GBIF service is an internal only processing service, but is expected to be released for public use in the coming months.
Improving the country name interpretation and making use of a more accurate geospatial verification service than previously will help improve data reporting at the national level using the GBIF portal as indicated here.
# Records | # Georeferenced | ||
---|---|---|---|
Argentina | Previously | 665,284 | 284,012 |
Now | 680,344 | 303,889 | |
United States | Previously | 79,432,986 | 68,900,415 |
Now | 81,483,086 | 70,588,182 |
No comments:
Post a Comment