Monday, 30 May 2011

Decoupling components

Recent blog posts have introduced some of the registry and portal processing work under development at GBIF.  Here I'd like to introduce some of the research underway to  improve the overall processing workflows by identifying well defined components and decoupling unnecessary dependencies.  The target being to improve the robustness, reliability and throughput of the data indexing performed for the portal.

Key to the GBIF portal is the crawling, processing and indexing of the content shared through the GBIF network, which is currently performed by the Harvesting and Indexing Toolkit (HIT).  Today the HIT operates largely as follows:
  1. Synchronise with the registry to discover the technical endpoints
  2. Allow the administrator to schedule the harvest and process of an endpoint, as follows:
    1. Initiate a metadata request to discover the datasets at the endpoint
    2. For each resource initiate a request for the inventory of distinct scientific names
    3. Process the names into ranges 
    4. Harvest the records by name range
    5. Process the harvested responses into tab delimited files
    6. Synchronise the tab delimited files with the database "verbatim" tables
    7. Process the "verbatim" tables into interpreted tables
Logically the HIT is depicted:
Some of the limitations in this model include:

  1. The tight coupling between the HIT and the target DB mean we need to stop the harvesting when we are going to perform very expensive processing on the database
  2. Changes to the user interface for the HIT require the harvester to be stopped
  3. The user interface console is driven by the same machine that is crawling, meaning the UI becomes unresponsive periodically.
  4. The tight coupling between the HIT and the target DB preclude the option of storing in multiple datastores (as is current desire as we investigate enriching the occurrence store)

The HIT can be separated into the following distinct concerns:

  1. An administration console to allow the scheduling, oversight and diagnostics of crawlers
  2. Crawlers that harvest the content 
  3. Synchronisers that interpret and persist the content into the target datastores  

An event driven architecture would allow this to happen and overcome the current limitations.  In this model, components can be deployed independently, and message each other through a queue when significant events occur .  Subscribers to the queue determine what action if any to take on a per message basis.  The architecture under research is shown:
In this depiction, the following sequence of events would occur:

  1. Through the Administration console, the administrator schedules the crawling of a resource.  
  2. The scheduler broadcasts to the queue that the resource is to be crawled rather than spawning a crawler directly.  
  3. When capacity allows, a crawler will act on this event and crawl the resource, storing to the filesystem as it goes.  On each response message, the crawler will broadcast that the response is to be handled.
  4. Synchronizers will act on the new response messages and store them in the occurrence target stores.  In the above depiction, there are actually 2 target stores, each of which would act on the message indicating there is new data to synchronise.
This architecture would have significant improvements to the existing setup.  The crawlers would only ever stop when bug fixing in the crawlers themselves occurs.  Different target stores can be researched independently of the crawling codebase.  The user interface for the scheduling can be developed, and redeployed without interrupting the crawling.  


As an aside, during this exercise we are also investigating improvements in the following:
  1. The HIT (today) performs the metadata request, but does NOT update the registry with the datasets that are discovered, only the data portal.  The GBIF registry is "dataset aware" for the datasets served through the Integrated Publishing Toolkit and ultimately we intend the registry to be able to reconcile the multiple identifiers associated with a dataset.  For example, it should be possible in the future to synchronise with the like of the Biodiversity Collections Index which is a dataset level registry.
  2. The harvesting procedure is rather complex, with many points for failure; it involves inventories of scientific names, processing into ranges of names and a harvest based on the name ranges.  Early tests suggest a more simpler approach of discrete name ranges [Aaa-Aaz, Aba-Abz ... Zza Zzz] yield better results.
Watch this space for results of this investigation...

Friday, 27 May 2011

The Phantom Records Menace

For a data administrator, going to the web test interface of data publisher can be incredibly useful if one needs to compare the data that was collected using the Harvesting and Indexing Toolkit: HIT and what is available from the publisher. In a perfect world transfer of records would happen without a glitch but when we eventually get less (or more!) than we asked for the search/test interfaces can be a real help (for instance the PyWrapper quering utilities)

Sometimes GBIF will index a resource that for no apparent reason turns in fewer records than what is expected from the line count that the HIT performs automatically. In this particular case there appears to be several identical records on top of that – which we are made aware of by the HIT that warns us that there are multiple records with the same “holy triplet”: Institution code, collection code and catalogue number.

Now what happens when a request goes out for this name range: Abies alba Mill. - Achillea millefolium L. followed by a request for Achillea millefolium agg. - Acinos arvensis (Lam.) Dandy? Those of you with good eyesight will have spotted that the request asks for Achillea millefolium L. before Achillea millefolium agg. This is because this particular instance or configuration of pywrapper returns a name range that is sorted according to the character values you find in UTF-8 and ASCII/Latin-1 which orders all upper-case characters before the lower-case ones. Whether this is an artifact of the underlying database system or the pywrapper itself, or even a specific version of the wrapper is not yet known, but the scenario exists today and consumers should be aware of this. The HIT then builds requests based on this name range and if the requests by chance divide between “Achillea millefolium L. and Achillea millefolium agg.” you will be receiving overlapping responses - that is two responses that contain parts of each other’s records – because the response is not based on a BINARY select statement and therefore returns the records alphabetically sorted without giving precedence to upper-case letters. This behavior can be replicated by going to the pywrapper interface and searching these name ranges. Fortunately the HIT removes the redundant records during the synchronizing process. However, the record count is based on the line count at the point where the records are received from the access point. This is why the record count in the HIT is inflated and as you see this kind of error can be am bit difficult to spot.

Monday, 23 May 2011

2011 GBIF Registry Refactoring

For the past couple of months, I have been working closely with another GBIF developer (and also fellow blog writer) Federico Mendez, on development tasks on the GBIF's Registry application. This post provides an overview of the work being done on this matter.

First, I will like to explain the nuts and bolts of the current Registry application (the one online), and then the additions/modifications it has "suffered" during 2011 (modifications have not been deployed). As stated on The evolution of the GBIF Registry blog post, in 2010 the Registry entered a new stage on its development by moving to a single DB,  enhanced web service API, and a web user interface. On top of this, an admin-only web interface was created so that we could do internal curation of the data inside the Secretariat.

Hibernate's framework was chosen as the preferred persistence framework and the Data-Access-Object (DAO) classes were coded with the HQL necessary to provide an interface to the Hibernate persistence mechanism. The Business tier consisted of several Manager classes that relied on the DAOs to get the required data. These Managers also were the ones responsible for populating the Data-Transfer-Objects (DTOs) so that they could be passed to the Presentation tier. This last tier made use of plain Java Server Pages (JSPs), along with JQuery, Ajax, CSS among others. Then, at the start of this year 2011, a decision was made to improve the application's underlying implementation in some aspects:

  1. Use of MyBatis data mapper framework. This involved walking away from Hibernate's Object-Relational Mapping (ORM) approach. Our use of Hibernate involved HQL, adding an extra latency component when converting HQL to SQL, but in MyBatis we use direct SQL mapped statements making it quicker to access the DB. (I will share some benchmarking on my next blog post, to justify this remark)

  2. We found out that using a DTO pattern represented somewhat of an overkill for an application that didn't had such complexity at the model level. We could trim some code complexity by passing the model objects straight to the presentation tier. So we did, and all DTOFactories & DTO objects were gone. 

  3. Several codebase improvements were introduced mainly by Federico, cutting down huge amounts of lines and making it easier to add new functionality with less effort (e.g. heavy use of Java's generics) 

  4. At the web service level, the Struts2 Rest plugin was replaced by the Jersey library. I personally found the Struts2 Rest plugin lacking documentation (1 year ago) so the Registry's use of it was kind of ad hoc. My next blog post will include more reasoning about this decision.

  5. We now make use of the Guice dependency injection framework. Beforehand, we were making use of Spring's ability for this. Also, these injections are made through annotations now; with Spring we were using XML based injection. 

  6. The Registry project is now divided into different libraries. In particular: 
    • registry-core: Business & persistence logic
    • registry-web: All related to the web application (Struts2)
    • registry-ws: All the web service stuff
    • There are also some libraries Federico has created to manage the interaction between the Registry and all technical installations (DiGIR, Tapir, BioCase, etc) of those publishers sharing data with GBIF. These are extremely important libraries as they are the ones who keep the Registry up to date.
(2011 refactoring)

I must emphasize again that these changes are not yet deployed, this in an ongoing project but if you are really interested to see the progress being made, please feel free to visit the project's site. Also, these changes won't affect the current web services API, or the DB structure. Merely the changes are to improve the underlying codebase. 

Thursday, 19 May 2011

Indexing bio-diversity metadata using Solr: schema, import-handlers, custom-transformers

This post is the second part of OAI-PMH Harvesting at GBIF. In that blog was explained how different OAI-PMH services are harvested. The subject of this post is introduce the overall architecture of the index created using the information gathered from those services. Let's start by justifying why we needed a metadata index at GBIF, one of the main requirements we had was allow "search datasets by a end-users". To enable this, the system provides two main search functionalities: Full Text Search and Advanced Search. For both functionalities the system will display a list of data sets containing the following information: title, provider, description (abstract) and hyperlink to view the full metadata document in the original format (DIF, EML, etc.) provided by the source; all that information was collected by the harvester. The results of any search had to be displayed with two, amog others, specific features: highlight the text that matched the searh criteria, and group/filter the results by facets: providers, dates and OAI-PMH services. In order to provide nice searh features we couldn't leave the responsability to the capabilities of a database, so we decided implement a index support the searh requirements by building a index capable of facilitate the user needs. An index is like a single-table database without any support for relational queries with only purpose to support search and not be the primary source of data. The structure of the index is de-normalized and contain just the data needed to be searched. The index was implemented using Solr which is an open source enterprise search. It has numerous other features such as search result highlighting, faceted navigation, query spell correction, auto-suggest queries and “more like this” for finding similar documents. The metadata application stores a subset of the available information in the metadata documents as Solr fields and a special field (fullText) is used to store the whole XML document to enable full text search, the schema fields are:
  • id: full file name is used for this field.
  • title: title of the dataset,
  • provider: provider of the dataset
  • providerExact: same as the previous field, but uses String data type for facets and exact match search
  • description: description or abstract of the dataset
  • beginDate: begin date of the temporal coverage of dataset
  • endDate: end date of the temporal coverage of dataset, when the input format only supports one dataset date, beginDate and endDate will contain the same value
  • westBoundingCoordinate: Geographic west coordinate
  • eastBoundingCoordinate: Geographic east coordinate
  • northBoundingCoordinate: Geographic north coordinate
  • southBoundingCoordinate: Geographic south coordinate
  • fullText: The complete text of the XML metadata document
  • externalUrl: Url containing specific information about the dataset; in the case of the
  • serverId: Id of the source OAI-PMH Service; this information is taken from the file system structure and is used for the facets search.
The XML documents gathered by the harvester are imported into Solr using data import handlers for each input format (EML, DIF,etc.). An example of one of the data import handlers is the following used for index dublin core xml files:


 
 
  
   
    
        
    
    
                
    
     
      
    
       
      
          
   
  
 

The data import handlers are implemented using three main features available in Solr:
  • FileDataSource: allows fetching content from files on disk.
  • FileListEntityProcessor: an entity processor used to enumerate the list of files.
  • XPathEntityProcessor: used to index the XML files, it allows defining of Xpath expressions to retrieve specific elements.
  • PlainTextEntityProcessor: reads all content from the data source into a single field; this processor is used to import the whole XML file into one field.
  • DateFormatTransformer: parses date/time strings into java.util.Date instances; it is used for the date fields.
  • RegexTransformer: helps in extracting or manipulating values from fields (from the source) using Regular Expressions.
  • TemplateTransformer: used to overwrite or modify any existing Solr field or to create new Solr fields; it is used to create the id field.
  • org.gbif.solr.handler.dataimport.ListDateFormatTransformer: this is a custom transformer to handle non-standard date formats that are common in input dates; it can handle dates with formats like: 12-2010, 09-1988, and (1998)-(2000). It has three important attributes: i) separator that defines the character/string to be used as separator between year and month fields, ii) lastDay to define if the date to be used with a particular year value (e.g., 1998) should be the first or the last day of the year: if the year is being interpreted as a beginDate, then the value is set to yyyy-01-01 and lastDay is set to false; if the year is interpreted as an endDate then the value is set to yyyy-12-31 and the lastDay value is set to true, iii) selectedDatePosition to define which date is being processed when a range of dates is present in the input field; for example:

<field column="beginDate" listDateTimeFormat="yyyy-MM-dd" selectedDatePosition="1" separator="_" lastDay="false" xpath="/dc/date"/>

imports the “dc/date” into the begin date using “_” as separator ; selectedDatePosition=”1” states the date to be processed is the first one in the range of dates and lastDay is thus set to false. The implementation of this custom handler is available on google code site. The web interface can be visited in this url, in a next blog I'll exaplained how this user interface was implemented using some a simple ajax framework.

Tuesday, 17 May 2011

Software quality control at GBIF

We've not only set up Hadoop here at GBIF but also introduced a few other new things. With the growing software development team we've felt the need to put some control measures in place to guarantee the quality of our software and to make the development process more transparent both for us at GBIF and hopefully for other interested parties as well.

GBIF projects have always been open source and hosted at their Google Code sites (e.g. GBIF Occurrencestore or the IPT). So in theory it was always possible for everyone to check every commit and review it. We've set up a Jenkins server however that does continuous integration for us which means that every time a change is made to one of our projects it is checked out and a full build is being run including all tests, code quality measurements (I'm going to get back to those later), web site creation (e.g. Javadocs) and publishing of the results to our Maven repository.

This is the first step in our new process. Every commit is checked in this way and we've had great success improving the stability of our builds in this way. Our Jenkins server is publicly visible at the URL http://hudson.gbif.org (background on the Hudson name in the URL can be found on Wikipedia).

As part of the process Jenkins also calls a code quality server called Sonar. Our Sonar instance is public as well. Take a look at the metrics for the IPT for example. You'll see a lot of information about our code, good and bad. We're not yet using this information extensively but are looking into useful metrics to incorporate them more closely into our development process. One example are some Coding Conventions to make the code consistent and easier to understand for everybody.

Once the build has finished the Sonar stage the results of the build are pushed to our Maven repository (which is running a Nexus server). That means we now have up to date SNAPSHOT builds of all our projects available (to use in our and your projects).

At the moment we don't have a lot of code contributions from outside of the GBIF to our projects but we hope that by making our development process more transparent we can encourage others to take a look as well.

We're always open for suggestions, questions and comments about our code base.

Monday, 16 May 2011

Here be dragons - mapping occurrence data

One of the most compelling ways of viewing GBIF data is on a map.  While name lists and detailed text are useful if you know what you're looking for, a map can give you the overview you need to start honing your search.  I've always liked playing with maps in web applications and recently I had the chance to add the functionality to our new Hadoop/Hive processing that answers the question "what species occurrence records exist in country x?".

Approximately 82% of the GBIF occurrence records have latitude and longitude recorded, but these often contain errors - typically typos, and often one or both of lat and long reversed.  Map 1, below, plots all of the verbatim (i.e. completely unprocessed) records that have a latitude and longitude and claim to be in the USA.  Note the common mistakes, which result in glaring errors: reversed longitude produces the near-perfect mirror over China; reversed latitude produces a faint image over the Pacific off the coast of Chile; reversing both produces an even fainter image off Australia; setting 0 for lat or long produces tell tale straight lines over the Prime Meridian and the equator.
Map 1: Verbatim (unprocessed) occurrence data coordinates for the USA
One of goals of the GBIF Secretariat is to help publishers improve their data, and identifying and reporting back these types of problems is one way of doing that.  Of course the current GBIF data portal attempts to filter these records before displaying them.  The current system for verifying that given coordinates fall within the country they claim is by overlaying a 1 degree grid on the world map, and identifying each of those grid points as belonging to one or more countries.  This overlay is curated by hand, and is therefore error prone, and its maintenance is time consuming.

The results of doing a lookup against the overlay are shown in Map 2, where a number of bugs in the processing are still visible: parts of the mirror over China are still visible; none of the coastal waters that are legally US territory (i.e. Exclusive Economic Zone of 200 nautical miles off shore) are shown; the Aleutian Islands off the coast of Alaska are not shown; and, some spots around the world are allowed through, including 0,0 and a few seemingly at random.
Map 2: Results of current data portal processing for occurrences in the USA
My work, then, was to build new processing into our Hive/Hadoop processing workflow that addresses these problems and produces a map that is as close to error free as possible.  The starting point is a webservice that can answer the question "In what country (including coastal waters) does this lat/long pair fall?".  This is clearly a GIS problem, and in GIS-speak this is a reverse geocode, and something that PostGIS is well equipped to provide.  Because country definitions and borders change semi-regularly, it seemed wisest to use a trusted source of country boundaries (shapefiles) that we could replace whenever needed.  Similarly we needed the boundaries of Exclusive Economic Zones to cover coastal waters. The political boundaries come from Natural Earth, and the EEZ boundaries shapefile come from the VLIZ Maritime Boundaries Geodatabase.

While not an especially difficult query to formulize, a word to the wise: if you're doing this kind of reverse geocode lookup, remember to build your query by scoping the distance query within its enclosing polygon, like so
where the_geom && ST_GeomFromText(#{point}, 4326) and distance(the_geom, geomfromtext(#{point}, 4326)) < 0.001
This buys an order of magnitude improvement in query response time!

With a thin webservice wrapper from Jersey, we have the GIS pieces built.  We opted for a webservice approach to allow us to ultimately expose this quality control utility externally in the future.  Since we process in Hadoop, we experienced huge stress on this web service - we were DDOS'ing ourselves.  I mentioned a similar approach in my last entry, where we alleviated the problem with load balancing across multiple machines.  And in case anyone is wondering why we didn't just use Google's reverse-geocoding webservice, the answer is twofold - first, it violates their terms of use, and second, even if we were allowed, they hold a rate limit on how many queries you can send over time, and that would have brought our workflow to its knees.

The last piece of the puzzle is adding the call to the webservice from a Hive UDF and adding it to our workflow, which is reasonably straight forward.  The result of the new processing is shown in Map 3, where the problems of Map 2 are all addressed.
Map 3: Results of new processing workflow for occurrences in the USA
These maps and the mapping cleanup processing will replace the existing maps and processing in our data portal later this year, hopefully in as little as a few months.

You can find the source of the reverse-geocode webservice at the Google code site for the occurrence-spatial project.  Similarly you can browse the source of the Hadoop/Hive workflow and the Hive UDFs.

Wednesday, 11 May 2011

The GBIF Spreadsheet Processor - an easy option to publish data

Most of data publishers in the GBIF Network use software wrappers to make data available on the web. To set up those tools, usually an institution or an individual needs to have certain degrees of technical capacity, and this more or less raises the threshold for publishing biodiversity data.

Imaging an entomologist who deals with collections and monographs everyday, the only thing s/he does on a PC is Word or Excel. S/he's got no student to help with, but keen to share the data before s/he retires. What is s/he going to do?

One of our tools is built to support this kind of scenario - the GBIF Darwin Core Archive Spreadsheet Processor, usually we just call it "the Spreadsheet Processor."

The Spreadsheet Processor is a web application that one can:

  1. Use templates provided on the web site;
  2. Fill and upload(or email) the xls file;
  3. Get a Darwin Core Archive file as the result.

This is a pretty straight-forward approach to prepare data for publishing, because the learning curve is flat if users already know how to use Excel, how to upload a file on a web site.

When the spreadsheet template is uploaded to the page, the web app first parses the values in the metadata sheet to generate an eml.xml, and then the occurrence or checklist sheet to generate an meta.xml and csv file. These files are then collected and zipped according to Darwin Core Archive standard - ready to download.

With a DwC-A file, the data is in a standardized format and ready to be published. In the example scenario above, this entomologist can either only share them among colleagues, or, send them to the nearest GBIF node which hosts IPT. Since IPT can digest a DwC-A file and publish it, the entomologist doesn't need to know the usage of IPT. To update it, s/he can revise the spreadsheet, create and send DwC-A to the node again.

P.S. This manual explains how to publish and register data in DwC-A format.

Tuesday, 10 May 2011

Reworking the HIT, after reworking the Portal processing

If GBIF reworks the Portal processing, then what would be the knock-on effect on the Harvesting and Indexing Toolkit (HIT)? This blog serves to talk a little about the future of the HIT, and very little about the new Portal processing (saved for later blogs).

To provide some background, the HIT has three major responsibilities:
  1. harvesting specimen and occurrence data from data publishers,
  2. writing that data in its raw form to the database, and 
  3. transforming raw data into its processed form running quality assurance routines (such as date and terrestrial point validation) and tying it to the backbone "nub" taxonomy.

When it is complete, the new Portal processing is actually going to do step 3. In the new processing, data will be extracted from the MySQL database into HBase (using sqoop) where quality assurance routines can be run much more quickly. Running outside of the MySQL database means that there won't be any more competition between steps 2 and 3 - step 3 constantly locking the raw data table in order to run its routines. That will mean the HIT will be able to write raw data uninterrupted to the database.

Lately I can tell you that the HIT has been having some frustrations trying to process large datasts. For example, a dataset with 12 million records, processing 10,000 records at a time, would lock the raw table for 10 minutes while scanning through the more than 280 million raw records in order to generate its record set. No raw data can be written at that time, thereby bringing the massively parallel application to its knees. Perhaps now you can understand why the rework of the Portal processing is so urgently needed.

For the few adopters of the HIT that will still require the application with its current functionality please rest assured that the project will just maintain a separate trimmed-down version when the time comes to adapt it. It will always remain an open-source application that anyone in the community can customize for their own needs.

Friday, 6 May 2011

Improving Hive join performance using Oozie

In the portal processing we are making use of Apache Hive to provide SQL capabilities and Yahoo!'s Oozie to provide a workflow engine.  In this blog I explain how we are making use of forks to improve the join performance of Hive, by further parallelizing the join beyond what Hive provides natively.
Please note that this was adopted using Hive version 0.5 but in Hive 0.7 there are significant improvements to joins
For the purposes of this explanation, let's consider the following simple example, where a table of verbatim values is being processed into four tables in a star schema:
To generate the leaves of the star, we have three simple queries (making use of a simple UDF to produce the increment IDs):


CREATE TABLE institution_code AS
SELECT rowSequence(), institution_code
FROM verbatim_record
GROUP BY institution_code;

CREATE TABLE collection_code AS
SELECT rowSequence(), collection_code
FROM verbatim_record
GROUP BY collection_code;

CREATE TABLE catalogue_number AS
SELECT rowSequence(), catalogue_number
FROM verbatim_record
GROUP BY catalogue_number;

To build the core of the star the simple approach is to issue the following SQL:

CREATE TABLE parsed_content AS
SELECT v.id AS id, ic.id AS institution_code_id, 
cc.id AS collection_code_id, cn.id AS catalogue_number_id
FROM verbatim_record v 
  JOIN institution_code ic ON v.institution_code=ic.institution_code
  JOIN collection_code cc ON v.collection_code=cc.collection_code
  JOIN catalogue_number cn ON v.catalogue_number=cn.catalogue_number;

What is important to note is that the JOIN is across 3 different values, and this results in a query plan with three sequential MR jobs, a very large intermediate result set, which is ultimately passed through the final Reduce in the Hive planning.

By using Oozie (see the bottom of this post for pseudo workflow config), we are able to produce three temporary join tables, in a parallel fork, and then do a single join to bring it all back together.

# parallel join 1
CREATE TABLE t1 AS
SELECT v.id AS id, ic.id AS institution_code_id 
FROM verbatim_record v JOIN institution_code ic ON v.institution_code=ic.institution_code;

# parallel join 2
CREATE TABLE t2 AS
SELECT v.id AS id, cc.id AS collection_code_id 
FROM verbatim_record v JOIN collection_code cc ON v.collection_code=cc.collection_code

# parallel join 3
CREATE TABLE t3 AS
SELECT v.id AS id, ic.id AS institution_code_id 
FROM verbatim_record v JOIN catalogue_number cn ON v.catalogue_number=cn.catalogue_number;

CREATE TABLE parsed_content AS
SELECT v.id AS id, t1.institution_code_id
t2.collection_code_id, t3.catalogue_number_id
FROM verbatim_record v 
  JOIN t1 ic ON v.id=t1.id
  JOIN t2 cc ON v.id=t2.id
  JOIN t3 cn ON v.id=t3.id;

Because we have built the join tables in parallel, and join on the foreign key only, Hive compiles to a single MR job, and runs much quicker.

In reality our tables are far more complex, and we use a Map side JOIN for the institution_code since it is small, but for our small cluster and the following table sizes we saw a reduction from several hours to 40 minutes to compute these tables.
  • verbatim_record: 284 million
  • collection_code: 1.5 million
  • catalogue_number: 199 million
  • institution_code: 8 thousand
All of this work can be found here

Pseudo workflow config for this:


  
  
  

  

  
    
    ...    
  
  
  

  

  
    
    ...    
  
  
  

  

  
    
    ...    
  
  
  

 



  
    
    ...    
  
  
  

Wednesday, 4 May 2011

Line terminating characters breaking Darwin Core Archive

Hi, I am Jan K. Legind the new data administrator at the GBIF Secretariat and one of my responsibilities is to ensure that datasets from publishers get indexed so that the data can be made available through the GBIF Portal. I am a historian by training and I have worked with archival data collection and testing prior to joining GBIF.

Recently I have been bug hunting a large dataset (DwC - Archive) that from a casual glance would look OK at the publisher side, but upon hitting the parser several records would be rejected because of the occurrence of line terminating characters in the records themselves (Hex value 0A). On top of that the individual record would be replaced by one empty line due to the illegal line termination AND another empty line would be added to that due to the tail end of the record appearing to the parser as the start of a new record, which of course would not be well-formed (thus being replaced with blank line number two). The parser will see a line that has too few columns and drop it. Since the line was bisected the tail end will also be conceived of as an individual line with an insufficient number of columns.

Here is an example of a record that would be replaced by an empty line:

The line terminating characters seems to have been escaped but without achieving the desired result. The secondary effect of this error is that the record count is miscalculated since the parser merely counts the lines and therefore ends up with a larger number than the publisher expected (remember that the line terminating character breaks the data file by producing two lines with an incorrect number of columns). Incidentally this example can sometimes explain why we harvest MORE than 100% of the target records.

By using the Integrated Publishing Toolkit (IPT) illegal characters can be avoided and the publishers will benefit from a faster transition into data appearing live in the GBIF portal. http://www.gbif.org/informatics/infrastructure/publishing/

Fortunately I am working in a joint effort with the publisher’s team on ironing out the bumps on this resource so we can get the data published fast and prevent future errors of this sort.

Monday, 2 May 2011

GBIF Data Portal

The current GBIF Data Portal was designed and implemented in 2005/2006, around the time I first joined the GBIF Secretariat in Copenhagen. As I am not a developer myself, but have been involved with the Data Portal for a long time, I thought I would take the opportunity to give a bit of a summary view of some of the components discussed in other posts here, looking at them more from the perspective of the Data Portal.

The GBIF Data Portal has been in operation more or less in its current form since mid 2007. From the time it was designed, the Portal's focus is on providing discovery of and access to primary species occurrence data (specimens in museums, observations in the field, culture strains and others). Since the launch, bug fixes and some minor changes were made, but development stopped due to new priorities. We did receive a lot of input on data content and functionality, though, both from data publishers and data users, and also through a number of reports and analyses.

Towards the end of 2010, a new development phase started, initiating version 2 of the GBIF Data Portal. This was the time to start taking care of all the known shortcomings and improvement requests, e.g. a more robust and reliable backbone taxonomy, improvement of data quality, better attribution of contributors, and others. However, this is not just a matter of adding some data or changing the user interface: a lot of those points first require considerable reworking of internal processing and workflows between the Data Portal and related components, blogged about in other contributions here:
  • quicker indexing and more frequent rollovers (publication cycles) from the non-public indexing database to the public web portal can only be achieved through a complete re-working of the rollover processing workflow.
  • a reliable taxonomic backbone required a review and re-implementation of name parsing routines, integrating lookup services, and following that, a complete regeneration of the taxonomic backbone
  • the demand for better attribution of data owners and service providers can only be met after having moved on to a new registry, better modelling the GBIF network structure, players and interactions. This is especially the case where datasets are aggregated or hosted, and both the owning and the service providing institution need to receive proper credit for their contributions
  • extended and improved metadata are needed to assess suitability of a dataset for specific applications (e.g. modelling), and to allow discovery of collections that are not digitised or not published
In 2011, GBIF Data Portal development focuses on consolidating and integrating these re-worked components, and on including both names (checklist) and metadata sources into the search functionality. The implied changes on the Portal user interface side are quite fundamental. With other known and future requirements on user interface functionality, the time has now come to replace the old Portal code base. At present, we are working with an external team to develop wireframes for key Portal pages, based on functionality requests from GBIFS regarding the integration of the new data areas and following evaluation of a number of sources (task group reports, reviews, participant reports etc). Those wireframes will aid further discussions on functionality starting from July, and also build the basis for implementation in 2011 and after. Once there is a public version available to look at, we will give an update.