Monday 16 May 2011

Here be dragons - mapping occurrence data

One of the most compelling ways of viewing GBIF data is on a map.  While name lists and detailed text are useful if you know what you're looking for, a map can give you the overview you need to start honing your search.  I've always liked playing with maps in web applications and recently I had the chance to add the functionality to our new Hadoop/Hive processing that answers the question "what species occurrence records exist in country x?".

Approximately 82% of the GBIF occurrence records have latitude and longitude recorded, but these often contain errors - typically typos, and often one or both of lat and long reversed.  Map 1, below, plots all of the verbatim (i.e. completely unprocessed) records that have a latitude and longitude and claim to be in the USA.  Note the common mistakes, which result in glaring errors: reversed longitude produces the near-perfect mirror over China; reversed latitude produces a faint image over the Pacific off the coast of Chile; reversing both produces an even fainter image off Australia; setting 0 for lat or long produces tell tale straight lines over the Prime Meridian and the equator.
Map 1: Verbatim (unprocessed) occurrence data coordinates for the USA
One of goals of the GBIF Secretariat is to help publishers improve their data, and identifying and reporting back these types of problems is one way of doing that.  Of course the current GBIF data portal attempts to filter these records before displaying them.  The current system for verifying that given coordinates fall within the country they claim is by overlaying a 1 degree grid on the world map, and identifying each of those grid points as belonging to one or more countries.  This overlay is curated by hand, and is therefore error prone, and its maintenance is time consuming.

The results of doing a lookup against the overlay are shown in Map 2, where a number of bugs in the processing are still visible: parts of the mirror over China are still visible; none of the coastal waters that are legally US territory (i.e. Exclusive Economic Zone of 200 nautical miles off shore) are shown; the Aleutian Islands off the coast of Alaska are not shown; and, some spots around the world are allowed through, including 0,0 and a few seemingly at random.
Map 2: Results of current data portal processing for occurrences in the USA
My work, then, was to build new processing into our Hive/Hadoop processing workflow that addresses these problems and produces a map that is as close to error free as possible.  The starting point is a webservice that can answer the question "In what country (including coastal waters) does this lat/long pair fall?".  This is clearly a GIS problem, and in GIS-speak this is a reverse geocode, and something that PostGIS is well equipped to provide.  Because country definitions and borders change semi-regularly, it seemed wisest to use a trusted source of country boundaries (shapefiles) that we could replace whenever needed.  Similarly we needed the boundaries of Exclusive Economic Zones to cover coastal waters. The political boundaries come from Natural Earth, and the EEZ boundaries shapefile come from the VLIZ Maritime Boundaries Geodatabase.

While not an especially difficult query to formulize, a word to the wise: if you're doing this kind of reverse geocode lookup, remember to build your query by scoping the distance query within its enclosing polygon, like so
where the_geom && ST_GeomFromText(#{point}, 4326) and distance(the_geom, geomfromtext(#{point}, 4326)) < 0.001
This buys an order of magnitude improvement in query response time!

With a thin webservice wrapper from Jersey, we have the GIS pieces built.  We opted for a webservice approach to allow us to ultimately expose this quality control utility externally in the future.  Since we process in Hadoop, we experienced huge stress on this web service - we were DDOS'ing ourselves.  I mentioned a similar approach in my last entry, where we alleviated the problem with load balancing across multiple machines.  And in case anyone is wondering why we didn't just use Google's reverse-geocoding webservice, the answer is twofold - first, it violates their terms of use, and second, even if we were allowed, they hold a rate limit on how many queries you can send over time, and that would have brought our workflow to its knees.

The last piece of the puzzle is adding the call to the webservice from a Hive UDF and adding it to our workflow, which is reasonably straight forward.  The result of the new processing is shown in Map 3, where the problems of Map 2 are all addressed.
Map 3: Results of new processing workflow for occurrences in the USA
These maps and the mapping cleanup processing will replace the existing maps and processing in our data portal later this year, hopefully in as little as a few months.

You can find the source of the reverse-geocode webservice at the Google code site for the occurrence-spatial project.  Similarly you can browse the source of the Hadoop/Hive workflow and the Hive UDFs.

8 comments:

  1. This is so amazing! Great work, guys!

    ReplyDelete
  2. Thanks, the improvement in map 3 is rather impressive !

    I have a few "workflow" questions though:

    1) Are the records edited after this has been done (for example, removing coordinates pair that are obviously wrong, such as 0,0 and marking the record as not-georeferenced) or is it only edit for "map view" ?
    2) Do you plan in the future to report the results of these quality checks to the data providers ? Maybe I'm a little naive, but I think an ideal solution would be to report these to allow the data provider to correct them at the source.
    3) Do you plan other tools/initiatives to encourage as much data cleaning as possible to be done "at the source". It seems to have a some advantages: avoid possible update conflicts when re-indexing, lessen the charge on GBIF servers when indexing the same data again, make providers more aware of data quality issues (I imagine you can sometimes have strange values that are NOT errors, and the data curator is the most competent person to judge that),...

    Thanks again for your work at the secretariat and the will to communicate about what you do, it's really great and useful for people like me :)

    Nicolas

    ReplyDelete
  3. Hi Nicolas,

    1) We don't edit records here - only mark them with an issue flag. Maps only show records with no known issues

    2) We already do in some manner, through the event log for the data owner. We plan to improve that though.

    3) Definitely cleaning at source is the best option. I have not looked at those tools myself much yet, but plan to. CRIA I know have developed great tools for that.

    Cheers,
    Tim

    ReplyDelete
  4. Thanks Tim, that gives me a better understanding !

    ReplyDelete
  5. A real improvement - well done.

    Agree with niconoe on 2. Feedback of errors to source data providers is key to closing the georef error loop. I'm sure there are many cases where GBIF data is being downloaded and cleaned for use (perhaps multiple times), but the errors are not getting back to the provider...so GBIF doesn't get any cleaner. Would be great to have a better mechanism for the feedback.

    ReplyDelete
  6. Thanks Steven. We do indeed plan on improving this and aim to be able to demonstrate annotation brokerage towards the end of 2011 to help provide better feedback.

    ReplyDelete
  7. Nice workflow! But we need to make better maps!!! These look outdated! :DI propose myself to help you on that :)

    ReplyDelete
  8. Do you still have the records that failed validation after the two steps available? I'd like to have a look at those sets of "outliers."

    ReplyDelete