Friday 26 August 2011

Darwin Core Archives for Species Checklists

GBIF has long had an ambition for supporting the sharing of annotated species checklists through the network. Realising this ambition has been frustrated by the lack of a data exchange standard of sufficient scope and simplicity as to promote publication of this type of resource. In 2009, the Darwin Core standard data set was formerly ratified by the TDWG, Biodiversity Information Standards. The addition of new terms, and a means of expressing these terms in a simplified and extensible text-based format, paved the way for the development of a data exchange profile for exchanging species checklists known as the Global Names Architecture (GNA) Profile. Species checklists, published in this format, can be zipped into single, portable, 'archive' files.
Here I introduce two example archives that illustrate the flexible scope of the format. The first represents a very simple species checklist while the second is a more richly documented taxonomic catalogue. The contents of any file can be viewed by clicking on the file icon or filename. A complete list of terms used in sharing checklists can be found here.
Example 1: U.S. National Arboretum Checklist
This checklist represents the most simple checklist archive. It consists of a document that describes the checklist and a second file with the checklist data itself. The checklist data consist of two columns. Note that by including column headers that match the standard DarwinCore term names, that no additional mapping document is needed.
EML.xml The checklist is documented using an Ecological Metadata Language (EML) document.
Checklist.txt The checklist itself is kept in this simple text file.
Example 2: Catalog of Living Whales
This checklist represents an annotated species checklist. In addition to the core species list ('whales.tab') there are numerous other data types consisting of Darwin Core extensions that conform to the GNA Profile. This more complex archive contains a resource map file ('meta.xml') that describes the files in the archive. An EML metadata document describes the catalog itself. This more complex archive uses a common identifier, taxonID, to link data in the extension files to the data records in the core species checklist ('whales.tab').
EML.xml The checklist is documented using a Ecological Metadata Language (EML) document. It includes a title, contacts, citation information and more.
whales.tabThe checklist itself is kept in this tab-delimited file.
meta.xml The files in the archive are described in this resource map file.
distribution.tab Distribution information conforming to the GNA Distribution extension are stored in this file.
references.tab Bibliographic references are stored in this file and linked to 'whales.tab' via the taxonID
types.tab Type specimen details are contained in this file.
vernaculars.tab Common name information that conforms to the GNA Vernacular Extension are stored in this file.

Sunday 21 August 2011

Configuring Drupal and some modules for ticketing emails

We at the Secretariat receive enquiries via helpdesk[at]gbif[dot]org, portal[at]gbif[dot]org and info[at]gbif[dot]org, everyday, or I would say, almost every hour. Some of them are provider-specific questions that need special attention from staff, while some others are FAQs. We have been thinking about better managing questions/issues, so by adding a little bit structure in the collaborative workflow, we can:

1. Make sure questions are answered with satisfaction; 2. Estimate how much man hours have been spent, or evaluate performance; 3. Improve efficiency on helpdesk activities.

To achieve these, we need softwares that meet these requirements:
1. Case management for incoming emails;
2. A Q&A cycle should be completed by solely using email. Web forms are good but not necessary in the beginning;
3. Easy configured knowledge base essays;
4. Graphical reports shows the helpdesk performance;
5. Automatic escalation of case status.

We looked for options from Open Source Help Desk List. While most of the sounding choices are tailored for software development cycle, some are commercial packages/services that indeed designed for enterprise help desk needs. While evaluating a few of those packages, I also found with Drupal and some modules, a solution that just meets our need is pretty out-of-box ready. The result is quite convincing and I can imagine the transition won't require too much learning of my colleagues.

Here is the recipe.

Materials and methods:

1. A mail server. All right I admit this is not something easy if you're not a system administrator. We use Dovecot to provide IMAP access to emails.
2. A Drupal installation. Installation instructions are here. As a wimp I choose version 6.
3. The Support module. Downloadable at http://drupal.org/project/support.
support_deadline, support_fields, support_timer, support_views, support_token, and support_nag are relevant modules that fit our purposes.
4. The CCK module. Downloadable at http://drupal/project/cck.
5. The Views module. Downloadable at http://drupal/project/views.
6. The Google Chart module. Downloadable at http://drupal.org/project/chart. Not "charts", which is a different module.
7. The Date module. Downloadable at http://drupal.org/project/date.
8. The Admin Menu module, for your administrative pleasure. Downloadable at http://drupal.org/project/admin_menu.
9. The Views Calc module, required by Support modules. Downloadable at http://drupal.org/project/views_calc.
10. Download all necessary modules to [drupalroot]/sites/all/modules directory. Enable them at [baseURL]/admin/build/modules.
11. You should see a "support ticketing system" menu by now. You need to

  1. Add an email client with an email account you set on the dovecot mail server;
  2. You probably want to change the email template at [baseURL]/admin/support/settings/mail;
  3. Go through the general settings of the ticketing system at [baseURL]/admin/support/settings.
12. Send some testing emails to the testing email address.
13. Visit [baseURL]/admin/support/clients/1/fetch, see if the system retrieve email and create tickets successfully.

14. After more testing emails have been sent to the address and fetched. You can visit [baseURL]/admin/support/charts.

These are just some facets of my explorations so far. Some details are not covered, like permissions in Drupal. Probably a newbie would need a crash course of Drupal to start, but after that things will be easier and faster.

Friday 12 August 2011

Using C3P0 with MyBatis

The problem

In our rollover process, which turns our raw harvested data into the interpreted occurrences you can see on our portal, we now have a step that calls a Web Service to turn geographical coordinates into country names. We use this to enrich and validate the incoming data.

This step in our process usually took about three to four hours but last week it stopped working all together without any changes to the Web Service or the input data.

We've spent a lot of time trying to find the problem and while we still can't say for sure what the exact problem is or was we've found a fix that works for us which also allows us to make some assumptions about the cause of the failure.

It is a project called geocode-ws and it is a very simple project that uses MyBatis to call a PostgreSQL (8.4.2) & PostGIS (1.4.0) database which does the GISy work of finding matches.

Our process started out fine. The first few million calls to the Web Service were fine and returning reasonably fast but then at the end the process slowed down until it came almost to a complete stop with response times of over 10 minutes. That's when our Hadoop maps timed out and failed.

With hindsight we should have come to our final conclusion much earlier but it took us a while.

Looking for the problem

We've looked at the PostgreSQL configuration and tweaked it a lot. We added a lot more logging and we've made sure to log any long running statements (using the log_min_duration_statement option). We also made sure that our memory settings are sensible and that we don't run out of memory. Looking at io- and vmstats as well as our Cacti monitoring we could see that this wasn't the case though. PostgreSQL didn't seem to be the problem.

We also looked at the OS configuration itself as well as the connectivity between our Hadoop cluster and this Tomcat and PostgreSQL server but couldn't find the problem either.

Then we began improving our Web Service and implemented a JMX MBean to get more detailed information about the process. While our changes should have improved the code base they didn't fix the problem. Finally we enabled GC logging on our Tomcat server (something we should have done much earlier and we will probably do by default for our servers in the future). We didn't do it earlier because the Web Service didn't experience any symptoms of memory leak issues before and we didn't change anything there. It hadn't even been restarted in a while.

But as it turned out the problem was garbage collection. Unfortunately I can't provide pretty graphs because I've overwritten the GC logs but it was very easy to see (using the awesome GCViewer) a typical pattern of minor collections not reclaiming all space and growing memory usage up until the point where almost no memory could be reclaimed and most of the time was spent in Garbage Collection. We found the problem! This explained our time outs.

It still doesn't explain what was leaking though. And having spent that much time on it we quickly gave up trying to find the problem. We suspect some kind of combination between the MyBatis Connection Pool, the PostgreSQL JDBC driver and our configuration.

Our workaround (the MyBatis & C3P0 part)

So we looked around for other connection pools for use with MyBatis but unfortunately we couldn't find a ready made thing. There are implementations in the MyBatis-Guice project but they can only be used with the Annotation based configuration and we're using XML.

We ended up writing our own implementation of a C3P0 DataSourceFactory and it turned out to be very very easy: It is just one class (JavaDoc here) with one line of code in it.

This not only solved our apparent memory leak but the performance increased by a factor of two to three as well. We haven't had a problem with our setup since.

Conclusion

We didn't have the time to find the real problem but we found a solution that works for us. I suspect had we gone about this better we might have found the problem a lot sooner and perhaps identified the real reason for it.
  • Enable GC logging!
  • Enable JMX for Tomcat and set up your applications with useful metrics and logging
  • Even though the use of Profilers is heavily disputed they can often help. We've found YourKit to be excellent
  • Try to follow a logical route, change only one thing at a time, mock things to find a problem
  • Monitor and graph your systems

Wednesday 3 August 2011

Indexing occurrences data - using Lucene technology

The GBIF Occurrence Index collects, stores and parses data gathered from different sources to provide a fast and accurate access to biodiversity occurrence data. The purpose of having a GBIF Index is optimize speed, relevance and performance of search functionalities that will be implemented by the new GBIF portal architecture.

Currently, GBIF has been providing search functionalities in its Data Portal supported in a semi-denormalized index relational database design, which allows find occurrence information by specifying filters to refine the expected results. That design was envisioned to support use cases of the actual GBIF Data Portal (a Web application); for the next generation of the GBIF platform, a new set of requirements must be achieved and is possible that the current index will not be able to support them, the most relevant of those requirements are: scheduling of batch exports, full text search, realtime faceted search and probably new schemas of data sharing with other biodiversity networks.

For implementing this new Occurrence Index, several technologies are under evaluated, each technology taken into considerationfor specific features that make them an attractive option, those are:
TechnologyDescription
PostgreSQLThis relational data base contains several features that worth evaluate: query optimization for JOIN-like queries, flexible key-value store, partial indices and multicolumn indices
Lucene IndexAt least four options are available for this implementation: pure Lucene Index, Katta, Apache Solr and ElasticSearch
MySQL This is the current implementation of the index, a evaluation could help to determine if this technology will be able to support new use cases
Key-value systemsSeveral schema-less data stores are available: CouchDB, Mongo, PostgreSQL hstore. The main concern about these technologies is their capabilities to handle a considerable amount of records
In this post will show some preliminary results in the evaluation of Lucene-based indices, specifically: Lucene as itself, Katta, Apache Solr and ElasticSearch. The analysis will keep apart two concerns (in this post only the index creation is cited):
  • Index creation, means how the index is created, split (in shards) and merged if necessary. 
  • Index use, refers in how the index performs in terms of usability (queries and search patterns), performance (response time) and througput.

Index Creation

Three scenarios were considered to the index creation phase:
  • Single process - n Indices: In this case a single process creates n-shards, the input data are split evenly; an IndexWriter is created for each shard. The case of n = 1 is considered part of this scenario, the # of shards is a parameter defined by the user, and is equal to the # of expected shards at the end of the process.
  • N threads - n Indices: The IndexWriter is a thread safe class, so it can be shared by several threads in order to create a single Index. The # of shards is defined by the user and internally is used to define the # of IndexWriters.
  • Distributed Index creation: in this case the index is created by splitting the input data into N shards, each shard is assigned to one process that contains a single IndexWriter which is responsible for the index creation.

Single process index creation

The process followed for this scenario is pretty straightforward:
  • The input is a row delimited file and each column is separated by a special character ('/001' in our case.
  • The # of shards input defines the number IndexWriters (only one IndexWriter can be opened for a Lucene Index).
  • Each row represents a Lucene document and is stored using one of the available index writers.
  • If multiple indices were created, at the final step those are merged into a single index (using "IndexWriter.addIndexesNoOptimize")
This process was tested using a 100 millions of records file. The entire process took 9200821 milliseconds (= 9200.821 seconds = 153.347016666666667 minutes = 2.555783611111111 hours) to finish. Some optimizations were implemented for this process, worth mention that the same sort of optimizations were applied for the multithread scenario:
  • Re-use the IndexWriter in multiple threads
  • Re-use the org.apache.lucene.document.Document and org.apache.lucene.document.Field instances. The Lucene fields are created in a static block and its value is changed for each new Document, then the document is added to the index. The intention of this is avoid the objects to be garbage collected.
      //Initialization
      static {
        for (int i = 0; i < accFieldsValues.length; i++) {
          fields[i] = new Field(accFieldsValues[i].name(), "", Store.YES, Index.ANALYZED);
        }
      }
     ...
     //Sets the field value
     fields[fieldsCount].setValue(stringTokenizer.nextToken());
     ...
     //Adds the same document instance with different values
     indexWriter.addDocument(doc);
    
  • The compound file format is turned off, this reduces the amount of files opened at the same time.
  • IndexWriter.autocommit is set to false: since the index doesn't provide searching during the creation time, this feature can be disable.
  • The flush is done by RAM, and the RAM usage is maximized:
     
    LogByteSizeMergePolicy logByteSizeMergePolicy = new LogByteSizeMergePolicy(); logByteSizeMergePolicy.setMergeFactor(mergeFactor);
    ...
    indexWriterConfig.setRAMBufferSizeMB(bufferSize);
    
  • Every N documents a entry is written in a log in order to notify the overall progress.

Multithreaded Index Creation

In terms of optimizations applied this scenario is very similar to the "Single process" scenario. However, the process is very different in terms of steps and the resulted index:
  • The input is a row delimited file and each column is separated by a special character ('/001' in our case).
  • The number of rows of the input file is known and is passed as input parameter.
  • The input file is split evenly in intermediate files, each file is assigned to a thread which will read it to create a Lucene Index.
  • The intermediate files are deleted after each index is created.
  • Depending of the number of shards desired , the indices are merged in smaller set of indices.
This process was run using: a pool of 50 threads and an input file with 100 million of rows. The execution time is detailed in the next table:
Phase Time
Slicing (split the input file and distribute it in the threads)1045948ms == 17.43 minutes
Indices creation6890988 == 114.8498 minutes
Total time 132.2798 minutes = 2.204663333333333 hours

Katta for Index Creation

Katta is a distributed storage of indices, currently supports 2 types of indices: Lucene and Hadoop MapFiles. It uses ZooKeeper to coordinate the index creation, replication and the search across the nodes.
Main relevant features
  • An Katta index is basically a folder containing sub-folders (shards)
  • The client-node communication is implemented using HadoopRPC
  • Supports distributed scoring, each search query requires two network roundtrips: get document frequencies in all shards and second perform the query.
  • Provides functionality to merge indices (though is not a very comple task to implement using standard Lucene libraries)
Relevant issues found
  • Documentation is not extensive and lacks of necessary detail.
  • Small community and the development is very low: last commit was done in 2009-04-2.
  • Doesn't provide any help to create the indeces, index sharding must be done prior to import them into Katta.
Test environment configuration
  • A cluster of 4 Katta nodes in 2 servers was used.
  • The index was split in 8 shards.
  • The master configuration is replicated in each node using passphraseless ssh access between master and nodes.
  • The ZooKeeper server was embedded into the Katta master node.
     katta.zk.properties (file) ==>zookeeper.embedded=true 
         
  • Each node contains 2 shards, each shard is replicated in 2 nodes.
  • The Lucene sharded index contains 100 millions of documents and was stored at the Hadoop distributed files system.
Index creation
Since Katta doesn't provide any functionality to create a Lucene from the scratch, the index was built using a multithread application and the were copied into Hadoop DFS. Then, the sharded index was imported into Katta using the command line:
bin/katta addIndex occurrence hdfs://namenode:port/occurrence/shardedindex/ 2
("2" means a replication factor of 2). Importing a index into Katta is just a matter of copy the file from Hadoop and update the Index status in the ZooKeeper server, so the index creation is external factor to Katta. In a next post the "Distributed Index Creation" scenario will be analyzed as well as the technologies ElasticSearch and Solr for index creation...