Thursday 19 May 2011

Indexing bio-diversity metadata using Solr: schema, import-handlers, custom-transformers

This post is the second part of OAI-PMH Harvesting at GBIF. In that blog was explained how different OAI-PMH services are harvested. The subject of this post is introduce the overall architecture of the index created using the information gathered from those services. Let's start by justifying why we needed a metadata index at GBIF, one of the main requirements we had was allow "search datasets by a end-users". To enable this, the system provides two main search functionalities: Full Text Search and Advanced Search. For both functionalities the system will display a list of data sets containing the following information: title, provider, description (abstract) and hyperlink to view the full metadata document in the original format (DIF, EML, etc.) provided by the source; all that information was collected by the harvester. The results of any search had to be displayed with two, amog others, specific features: highlight the text that matched the searh criteria, and group/filter the results by facets: providers, dates and OAI-PMH services. In order to provide nice searh features we couldn't leave the responsability to the capabilities of a database, so we decided implement a index support the searh requirements by building a index capable of facilitate the user needs. An index is like a single-table database without any support for relational queries with only purpose to support search and not be the primary source of data. The structure of the index is de-normalized and contain just the data needed to be searched. The index was implemented using Solr which is an open source enterprise search. It has numerous other features such as search result highlighting, faceted navigation, query spell correction, auto-suggest queries and “more like this” for finding similar documents. The metadata application stores a subset of the available information in the metadata documents as Solr fields and a special field (fullText) is used to store the whole XML document to enable full text search, the schema fields are:
  • id: full file name is used for this field.
  • title: title of the dataset,
  • provider: provider of the dataset
  • providerExact: same as the previous field, but uses String data type for facets and exact match search
  • description: description or abstract of the dataset
  • beginDate: begin date of the temporal coverage of dataset
  • endDate: end date of the temporal coverage of dataset, when the input format only supports one dataset date, beginDate and endDate will contain the same value
  • westBoundingCoordinate: Geographic west coordinate
  • eastBoundingCoordinate: Geographic east coordinate
  • northBoundingCoordinate: Geographic north coordinate
  • southBoundingCoordinate: Geographic south coordinate
  • fullText: The complete text of the XML metadata document
  • externalUrl: Url containing specific information about the dataset; in the case of the
  • serverId: Id of the source OAI-PMH Service; this information is taken from the file system structure and is used for the facets search.
The XML documents gathered by the harvester are imported into Solr using data import handlers for each input format (EML, DIF,etc.). An example of one of the data import handlers is the following used for index dublin core xml files:


 
 
  
   
    
        
    
    
                
    
     
      
    
       
      
          
   
  
 

The data import handlers are implemented using three main features available in Solr:
  • FileDataSource: allows fetching content from files on disk.
  • FileListEntityProcessor: an entity processor used to enumerate the list of files.
  • XPathEntityProcessor: used to index the XML files, it allows defining of Xpath expressions to retrieve specific elements.
  • PlainTextEntityProcessor: reads all content from the data source into a single field; this processor is used to import the whole XML file into one field.
  • DateFormatTransformer: parses date/time strings into java.util.Date instances; it is used for the date fields.
  • RegexTransformer: helps in extracting or manipulating values from fields (from the source) using Regular Expressions.
  • TemplateTransformer: used to overwrite or modify any existing Solr field or to create new Solr fields; it is used to create the id field.
  • org.gbif.solr.handler.dataimport.ListDateFormatTransformer: this is a custom transformer to handle non-standard date formats that are common in input dates; it can handle dates with formats like: 12-2010, 09-1988, and (1998)-(2000). It has three important attributes: i) separator that defines the character/string to be used as separator between year and month fields, ii) lastDay to define if the date to be used with a particular year value (e.g., 1998) should be the first or the last day of the year: if the year is being interpreted as a beginDate, then the value is set to yyyy-01-01 and lastDay is set to false; if the year is interpreted as an endDate then the value is set to yyyy-12-31 and the lastDay value is set to true, iii) selectedDatePosition to define which date is being processed when a range of dates is present in the input field; for example:

<field column="beginDate" listDateTimeFormat="yyyy-MM-dd" selectedDatePosition="1" separator="_" lastDay="false" xpath="/dc/date"/>

imports the “dc/date” into the begin date using “_” as separator ; selectedDatePosition=”1” states the date to be processed is the first one in the range of dates and lastDay is thus set to false. The implementation of this custom handler is available on google code site. The web interface can be visited in this url, in a next blog I'll exaplained how this user interface was implemented using some a simple ajax framework.

No comments:

Post a Comment