Wednesday, 20 April 2011

Cleanup of occurrence records

Lars here, like Oliver I've started here in October 2010 and have no biology background either so my first step here at GBIF was to set up the infrastructure Tim was mentioning before, but I've written about that already (at length).

To continue the series of blog posts that was started by Oliver, and in no particular order, I'll talk about what we are doing to process the incoming data - which is the task I was given after the Hadoop setup was done.

During our rollover we're processing Occurrence records. Millions of them, about 270 millions at the moment and we expect this to grow significantly over the next few months and years. It is only natural that there is bound to be bad data in there for various reasons. These might be everything from simple typos to misconfigured publishing tools and transfer errors.

The more we know about the domain and the data the more we are obviously able to fix. Any input is appreciated on how we could do better on this part of our processing.

For fields like kingdom, phylumcountry name or basis of record we do a simple lookup in a dictionary to look for common mistakes and replace those with the proper versions. Other fields like class, order, family, genus and author have way too many distinct values for us to prepare a dictionary with all the possible errors and their correct forms. That is why we only apply a few safe clean up procedures here (e.g. remove blacklisted names or invalid characters).

Scientific names are additionally parsed by the NameParser in the ECAT project which does all kinds of fancy magic to try to infer a correct name. Altitudes, depths and coordinates get treatment as well by looking at common unit markers and errors we've seen in the past.

And last but not least we also try to make most out of the dates we get. As everyone who ever dealt with date strings knows this can be one of the hardest topics in an internationalized environment. In theory our input data consists of three nicely formatted fields: year, month and day. In reality though a lot of dates are just in the year field. We've got all kinds of delimiters (with "/" and "-" being among the most common ones), abbreviations ("Mar") and database export fragments ("1978.0" because it was a floating point variable in the database), missing data and more.

Additionally we obviously have to deal with different time formats. Is "01/02/02" the first of February or the second of January? In most cases we can only guess.

Having said that: We've rewritten large parts of the date handling routines and are continuing to improve them as we know that this is an important part of our data. Feedback on how we're doing here is greatly appreciated!

I'm really hoping to have a chance to compile a few statistics about our incoming data quality once we've tested all of this in production.

No comments:

Post a Comment