Approximately 82% of the GBIF occurrence records have latitude and longitude recorded, but these often contain errors - typically typos, and often one or both of lat and long reversed. Map 1, below, plots all of the verbatim (i.e. completely unprocessed) records that have a latitude and longitude and claim to be in the USA. Note the common mistakes, which result in glaring errors: reversed longitude produces the near-perfect mirror over China; reversed latitude produces a faint image over the Pacific off the coast of Chile; reversing both produces an even fainter image off Australia; setting 0 for lat or long produces tell tale straight lines over the Prime Meridian and the equator.
Map 1: Verbatim (unprocessed) occurrence data coordinates for the USA |
The results of doing a lookup against the overlay are shown in Map 2, where a number of bugs in the processing are still visible: parts of the mirror over China are still visible; none of the coastal waters that are legally US territory (i.e. Exclusive Economic Zone of 200 nautical miles off shore) are shown; the Aleutian Islands off the coast of Alaska are not shown; and, some spots around the world are allowed through, including 0,0 and a few seemingly at random.
Map 2: Results of current data portal processing for occurrences in the USA |
While not an especially difficult query to formulize, a word to the wise: if you're doing this kind of reverse geocode lookup, remember to build your query by scoping the distance query within its enclosing polygon, like so
where the_geom && ST_GeomFromText(#{point}, 4326) and distance(the_geom, geomfromtext(#{point}, 4326)) < 0.001This buys an order of magnitude improvement in query response time!
With a thin webservice wrapper from Jersey, we have the GIS pieces built. We opted for a webservice approach to allow us to ultimately expose this quality control utility externally in the future. Since we process in Hadoop, we experienced huge stress on this web service - we were DDOS'ing ourselves. I mentioned a similar approach in my last entry, where we alleviated the problem with load balancing across multiple machines. And in case anyone is wondering why we didn't just use Google's reverse-geocoding webservice, the answer is twofold - first, it violates their terms of use, and second, even if we were allowed, they hold a rate limit on how many queries you can send over time, and that would have brought our workflow to its knees.
The last piece of the puzzle is adding the call to the webservice from a Hive UDF and adding it to our workflow, which is reasonably straight forward. The result of the new processing is shown in Map 3, where the problems of Map 2 are all addressed.
Map 3: Results of new processing workflow for occurrences in the USA |
You can find the source of the reverse-geocode webservice at the Google code site for the occurrence-spatial project. Similarly you can browse the source of the Hadoop/Hive workflow and the Hive UDFs.
This is so amazing! Great work, guys!
ReplyDeleteThanks, the improvement in map 3 is rather impressive !
ReplyDeleteI have a few "workflow" questions though:
1) Are the records edited after this has been done (for example, removing coordinates pair that are obviously wrong, such as 0,0 and marking the record as not-georeferenced) or is it only edit for "map view" ?
2) Do you plan in the future to report the results of these quality checks to the data providers ? Maybe I'm a little naive, but I think an ideal solution would be to report these to allow the data provider to correct them at the source.
3) Do you plan other tools/initiatives to encourage as much data cleaning as possible to be done "at the source". It seems to have a some advantages: avoid possible update conflicts when re-indexing, lessen the charge on GBIF servers when indexing the same data again, make providers more aware of data quality issues (I imagine you can sometimes have strange values that are NOT errors, and the data curator is the most competent person to judge that),...
Thanks again for your work at the secretariat and the will to communicate about what you do, it's really great and useful for people like me :)
Nicolas
Hi Nicolas,
ReplyDelete1) We don't edit records here - only mark them with an issue flag. Maps only show records with no known issues
2) We already do in some manner, through the event log for the data owner. We plan to improve that though.
3) Definitely cleaning at source is the best option. I have not looked at those tools myself much yet, but plan to. CRIA I know have developed great tools for that.
Cheers,
Tim
Thanks Tim, that gives me a better understanding !
ReplyDeleteA real improvement - well done.
ReplyDeleteAgree with niconoe on 2. Feedback of errors to source data providers is key to closing the georef error loop. I'm sure there are many cases where GBIF data is being downloaded and cleaned for use (perhaps multiple times), but the errors are not getting back to the provider...so GBIF doesn't get any cleaner. Would be great to have a better mechanism for the feedback.
Thanks Steven. We do indeed plan on improving this and aim to be able to demonstrate annotation brokerage towards the end of 2011 to help provide better feedback.
ReplyDeleteNice workflow! But we need to make better maps!!! These look outdated! :DI propose myself to help you on that :)
ReplyDeleteDo you still have the records that failed validation after the two steps available? I'd like to have a look at those sets of "outliers."
ReplyDelete