Developer Blog: Here be dragons - mapping occurrence data

Monday, 16 May 2011

Here be dragons - mapping occurrence data

One of the most compelling ways of viewing GBIF data is on a map. While name lists and detailed text are useful if you know what you're looking for, a map can give you the overview you need to start honing your search. I've always liked playing with maps in web applications and recently I had the chance to add the functionality to our new Hadoop/Hive processing that answers the question "what species occurrence records exist in country x?".

Approximately 82% of the GBIF occurrence records have latitude and longitude recorded, but these often contain errors - typically typos, and often one or both of lat and long reversed. Map 1, below, plots all of the verbatim (i.e. completely unprocessed) records that have a latitude and longitude and claim to be in the USA. Note the common mistakes, which result in glaring errors: reversed longitude produces the near-perfect mirror over China; reversed latitude produces a faint image over the Pacific off the coast of Chile; reversing both produces an even fainter image off Australia; setting 0 for lat or long produces tell tale straight lines over the Prime Meridian and the equator.


Map 1: Verbatim (unprocessed) occurrence data coordinates for the USA

One of goals of the GBIF Secretariat is to help publishers improve their data, and identifying and reporting back these types of problems is one way of doing that. Of course the current GBIF data portal attempts to filter these records before displaying them. The current system for verifying that given coordinates fall within the country they claim is by overlaying a 1 degree grid on the world map, and identifying each of those grid points as belonging to one or more countries. This overlay is curated by hand, and is therefore error prone, and its maintenance is time consuming.

The results of doing a lookup against the overlay are shown in Map 2, where a number of bugs in the processing are still visible: parts of the mirror over China are still visible; none of the coastal waters that are legally US territory (i.e. Exclusive Economic Zone of 200 nautical miles off shore) are shown; the Aleutian Islands off the coast of Alaska are not shown; and, some spots around the world are allowed through, including 0,0 and a few seemingly at random.

Map 2: Results of current data portal processing for occurrences in the USA

My work, then, was to build new processing into our Hive/Hadoop processing workflow that addresses these problems and produces a map that is as close to error free as possible. The starting point is a webservice that can answer the question "In what country (including coastal waters) does this lat/long pair fall?". This is clearly a GIS problem, and in GIS-speak this is a reverse geocode, and something that PostGIS is well equipped to provide. Because country definitions and borders change semi-regularly, it seemed wisest to use a trusted source of country boundaries (shapefiles) that we could replace whenever needed. Similarly we needed the boundaries of Exclusive Economic Zones to cover coastal waters. The political boundaries come from Natural Earth, and the EEZ boundaries shapefile come from the VLIZ Maritime Boundaries Geodatabase.

While not an especially difficult query to formulize, a word to the wise: if you're doing this kind of reverse geocode lookup, remember to build your query by scoping the distance query within its enclosing polygon, like so

where the_geom && ST_GeomFromText(#{point}, 4326) and distance(the_geom, geomfromtext(#{point}, 4326)) < 0.001

This buys an order of magnitude improvement in query response time!

With a thin webservice wrapper from Jersey, we have the GIS pieces built. We opted for a webservice approach to allow us to ultimately expose this quality control utility externally in the future. Since we process in Hadoop, we experienced huge stress on this web service - we were DDOS'ing ourselves. I mentioned a similar approach in my last entry, where we alleviated the problem with load balancing across multiple machines. And in case anyone is wondering why we didn't just use Google's reverse-geocoding webservice, the answer is twofold - first, it violates their terms of use, and second, even if we were allowed, they hold a rate limit on how many queries you can send over time, and that would have brought our workflow to its knees.

The last piece of the puzzle is adding the call to the webservice from a Hive UDF and adding it to our workflow, which is reasonably straight forward. The result of the new processing is shown in Map 3, where the problems of Map 2 are all addressed.

Map 3: Results of new processing workflow for occurrences in the USA

These maps and the mapping cleanup processing will replace the existing maps and processing in our data portal later this year, hopefully in as little as a few months.

You can find the source of the reverse-geocode webservice at the Google code site for the occurrence-spatial project. Similarly you can browse the source of the Hadoop/Hive workflow and the Hive UDFs.

8 comments:

Unknown16 May 2011 at 09:32
This is so amazing! Great work, guys!
ReplyDelete
Replies
niconoe16 May 2011 at 12:03
Thanks, the improvement in map 3 is rather impressive !

I have a few "workflow" questions though:

1) Are the records edited after this has been done (for example, removing coordinates pair that are obviously wrong, such as 0,0 and marking the record as not-georeferenced) or is it only edit for "map view" ?
2) Do you plan in the future to report the results of these quality checks to the data providers ? Maybe I'm a little naive, but I think an ideal solution would be to report these to allow the data provider to correct them at the source.
3) Do you plan other tools/initiatives to encourage as much data cleaning as possible to be done "at the source". It seems to have a some advantages: avoid possible update conflicts when re-indexing, lessen the charge on GBIF servers when indexing the same data again, make providers more aware of data quality issues (I imagine you can sometimes have strange values that are NOT errors, and the data curator is the most competent person to judge that),...

Thanks again for your work at the secretariat and the will to communicate about what you do, it's really great and useful for people like me :)

Nicolas
ReplyDelete
Replies
Tim Robertson16 May 2011 at 12:08
Hi Nicolas,

1) We don't edit records here - only mark them with an issue flag. Maps only show records with no known issues

2) We already do in some manner, through the event log for the data owner. We plan to improve that though.

3) Definitely cleaning at source is the best option. I have not looked at those tools myself much yet, but plan to. CRIA I know have developed great tools for that.

Cheers,
Tim
ReplyDelete
Replies
niconoe16 May 2011 at 13:27
Thanks Tim, that gives me a better understanding !
ReplyDelete
Replies
Steven18 May 2011 at 11:48
A real improvement - well done.

Agree with niconoe on 2. Feedback of errors to source data providers is key to closing the georef error loop. I'm sure there are many cases where GBIF data is being downloaded and cleaned for use (perhaps multiple times), but the errors are not getting back to the provider...so GBIF doesn't get any cleaner. Would be great to have a better mechanism for the feedback.
ReplyDelete
Replies
Tim Robertson19 May 2011 at 09:27
Thanks Steven. We do indeed plan on improving this and aim to be able to demonstrate annotation brokerage towards the end of 2011 to help provide better feedback.
ReplyDelete
Replies
Javier de la Torre16 June 2011 at 16:08
Nice workflow! But we need to make better maps!!! These look outdated! :DI propose myself to help you on that :)
ReplyDelete
Replies
Tuco23 June 2011 at 16:01
Do you still have the records that failed validation after the two steps available? I'd like to have a look at those sets of "outliers."
ReplyDelete
Replies

Add comment

Monday, 16 May 2011

Here be dragons - mapping occurrence data

8 comments:

About this blog

Twitter

Contributors

Our favourite blogs

Blog Archive

Tags

Followers

Monday, 16 May 2011

Here be dragons - mapping occurrence data

8 comments:

About this blog

Twitter

Contributors

Our favourite blogs

Blog Archive

Subscribe To

Tags

Followers