Developer Blog: 2011

Thursday 8 December 2011

Updating a customized IPT

This post originally appeared on the Canadensys blog and is a follow-up of the post Customizing the IPT.

As mentioned at the very end of my post about customizing the IPT, I face a problem when I want to install a new version of the GBIF Integrated Publishing Toolkit: installing it will overwrite all my customized files! Luckily Tim Robertson gave me a hint on how to solve this: a shell script to reapply my customization.

Here's how it works (for Mac and Linux systems only):

Comparing the customized files with the default files

First of all, I need to compare my customized files with the files from the new IPT. They might have changed to include new functionalities or fix bugs. So, I installed the newest version of IPT on my localhost, opened the default files and compared them with my files. Although there are tools to compare files, I mostly did this manually. The biggest change in version 2.0.3 was the addition of localization, for which I'm using a different UI, so I had to tweak some things here and there. It took me about 3 hours until I was satisfied with the new customized IPT version on my localhost.

I also subscribed to the RSS of the IPT Google Code website, to be notified of any changes in the code of "my" files, but I was just using this as a heads-up for coming changes. It is more efficient to change everything at once, when a stable version of IPT is out.

RSS subscription for any changes in /webapp/WEB-INF/pages/inc, which contains most of my customized files
RSS subscription for any changes in /webapp/styles/main.css, where I'm commenting out a lot of stuff so my CSS can kick in.

Setting up a file structure

This is how we've organized the files on our server. I've created a folder called ipt-customization, which contains all my customized files. That way, they can never be overwritten by a new IPT installation, which gets deployed in webapps. The folder also contains a script to apply the customization and a folder to backup the default files currently used by IPT.

ipt-data
webapps
- ipt
ipt-customization
- backup-default
- apply-customization.sh
- revert-customization.sh
- header.ftl
- header_setup.ftl
- menu.ftl
- footer.ftl
- main.css
- custom.js

Creating the shell script

The apply-customization.sh script works in two steps:

Backup the default files, by copying them from IPT to the folder backup-default. The script will ask if I want to overwrite any previously backed up files. The last part is important if I'm running the script several times. In that case I do not want to overwrite the backups with the already customized files.
Overwrite the files currently used by IPT with the customized files, by copying them from my ipt-customization folder to the correct folder in IPT

# backup files of new IPT installation

cp -i ../webapps/ipt/WEB-INF/pages/inc/footer.ftl ../ipt-customization/backup-default/

cp -i ../webapps/ipt/WEB-INF/pages/inc/header_setup.ftl ../ipt-customization/backup-default/

cp -i ../webapps/ipt/WEB-INF/pages/inc/header.ftl ../ipt-customization/backup-default/

cp -i ../webapps/ipt/WEB-INF/pages/inc/menu.ftl ../ipt-customization/backup-default/

cp -i ../webapps/ipt/styles/main.css ../ipt-customization/backup-default/



# apply customization

cp footer.ftl ../webapps/ipt/WEB-INF/pages/inc/

cp header_setup.ftl ../webapps/ipt/WEB-INF/pages/inc/

cp header.ftl ../webapps/ipt/WEB-INF/pages/inc/

cp menu.ftl ../webapps/ipt/WEB-INF/pages/inc/

cp main.css ../webapps/ipt/styles/

cp custom.js ../webapps/ipt/js/

I also created a script revert-customization.sh, to revert the customization to the default IPT, in case something is broken. It moves the backed up files back to IPT:

# revert customization

cp backup-default/footer.ftl ../webapps/ipt/WEB-INF/pages/inc/

cp backup-default/header_setup.ftl ../webapps/ipt/WEB-INF/pages/inc/

cp backup-default/header.ftl ../webapps/ipt/WEB-INF/pages/inc/

cp backup-default/menu.ftl ../webapps/ipt/WEB-INF/pages/inc/

cp backup-default/main.css ../webapps/ipt/styles/

rm ../webapps/ipt/js/custom.js

Running the script

From the command line, I login to my server, navigate to the folder ipt-customization and make my script executable:

chmod +x apply-customization.sh

I only have to do this the first time I want to use my script. From then on I can use:

sh ./apply-customization.sh

To execute the script and customize my new version of IPT!

Monday 5 December 2011

Bug fixing in the GBIF Data Portal

Despite our current efforts to develop a new Portal v2, our current data portal at data.gbif.org has not been left unattended. Bug fixes are being done periodically from feedback sent to us from our user community. In order to keep our community informed, this post will summarize the most important fixes and enhancements done in the past months:

The data portal's main page now shows the total number of occurrence records with coordinates, along with the total count of records (non-georeferenced and georeferenced).

Decimal coordinate searches were not working properly. When a user wanted to refine their coordinate searches to use decimals, the data portal was returning an erroneous count of occurrence records. Issue was fixed. Details here.

Any feedback e-mail message sent from an occurrence or a taxon page now includes the original sender's email address in the CC field. Previously the sender's email address was not included in the feedback email, which represented a problem when the receiver replied to the email, but the sender never knew about the reply.

The Taxon Web Service's GET operation was returning errors when trying to request some specific taxons. The problem was detected and fixed.

On an occurrence detail page, when retrieving the original record from a data publisher, and the source data came from a Darwin Core Archive, it was not possible to retrieve a single record due to the single-file nature of a DwC Archive. (As opposed to a DiGIR request, in which you could extract just a single record). A fix was introduced so that the user can decide if he/she wants to download the complete archive (see an example).

When using the data portal's Web Services to produce KML output, there were some problems when the generated KML contained HTML elements and Google Earth tried to open the file for visualization (This is a standard problem of XML). A small fix was introduced to escape the conflicting HTML inside the KML output.

Other small GUI enhancements where also done.

Updates to the data portal's codebase is now done seldomly, but our goal is to fix any major issues that our user community reports. If you ever encounter problems, please don't hesitate to contact us at portal@gbif.org.

Wednesday 9 November 2011

Important Quality Boost for GBIF Data Portal

Improvements speed processing, “clean” name and location data, enable checklist publishing.

[This is a reposting from the GBIF news site]

A major upgrade to enhance the quality and usability of data accessible through the GBIF Data Portal has gone live.

The enhancements are the result of a year’s work by developers at the Copenhagen-based GBIF Secretariat, in collaboration with colleagues throughout the worldwide network.

They respond to a range of issues including the need for quicker ‘turnaround’ time between entering new data and their appearance on the portal; filtering out inaccurate or incorrect locations and names for species occurrences; and enabling species checklists to be indexed as datasets accessible through the portal.

After a testing period, the changes now apply to the more than 312 million biodiversity data records currently indexed from some 8,500 datasets and 340 publishers worldwide.

Key improvements include:

•    processing time for data has fallen from 3-4 days to around 36 hours, paving the way for more frequent ‘rollovers’ or index updates;
•    the ‘backbone taxonomy’ used by the GBIF Portal has been reworked with up-to-date checklists and taxonomic catalogues such as the Catalogue of Life 2011, improving search and download;
•    checklists describing species in particular geographic locations, taxonomic groups or thematic categories (eg. invasives) can now be published using a standard set of terms called the Global Names Architecture (GNA) Profile (see GNA guidelines) and thus become indexed and accessible via the Data Portal;
•    automated interpretation of the coordinates, country location and scientific names used in published records has been improved to screen out inaccuracies – for example, ensuring that records identified as coming from a particular country are shown as occurring within the borders and territorial waters of that country; and
•    a mechanism using the Hadoop open-source software system has been introduced to ensure that the Data Portal is able to cope with anticipated future growth in the volume of data.

The algorithms and dictionaries developed to improve interpretation of data published through the GBIF Data Portal are intended for future re-use by the wider biodiversity informatics community.

Commenting on the release of these substantive Data Portal improvements, GBIF Executive Secretary Nicholas King said: “These changes represent a major step forward in the usefulness of GBIF to science and society.

“They are a direct response to the feedback we have had from the data publishing and user communities, and will enable an even greater return on the long-term investment made over the past decade by GBIF Participant countries.”

IPT v.2.0.3 launched

The GBIF Secretariat has also issued a new release of the Integrated Publishing Toolkit (IPT), which enables biodiversity data updates to be ‘harvested’ automatically from databases published to the Internet.

IPT version 2.0.3 addressed 76 reported issues from the previous version, and includes translations into French and Spanish.

Instructions on installing the new version are available here.

Friday 21 October 2011

Integration tests with DBUnit

Database driven JUnit tests
As part of our migration to a solid, general testing framework we are now using DbUnit for database integration tests of our database service layer with JUnit (on top of liquibase for the DDL).

Creating a DbUnit test file
As it can be painful to maintain a relational test dataset with many tables, I've decided to dump a small, existing Postgres database into the DbUnit XML structure, namely FlatXML. It turned out to be less simple as I had hoped for.

First I've created a simple exporter script in Java that dumps the entire DB into XML. Simple.

The first problem I've stumbled across was a column named "order" which caused a SQL exception. It turns out DbUnit needs to be configured for specific databases, so I've ended up using three configurations to both dump and read the files.

Use Postgres specific types
Double quote column and table names
Enable case sensitive table & column names (now that we use quoted names, Postgres becomes case sensitive)

After that, reading in the DbUnit test file started out fine, but reached a weird NullPointerException that left me puzzled. After trying various settings I finally found a log warning that some columns might not be detected properly by DbUnit, as it only inspects the first XML record by default which can contain many null columns which subsequently then will be ignored. Luckily since version 4.3.7 of dbunit you can tell the builder that reads in the test files to scan all records first in memory, a feature know as column sensing. That got me a long way, but ultimately I've hit a much harder issue. Relational integrity.

The classic way to avoid integrity checks during inserts (including DbUnit) is simply to temporarily disable all foreign key constraints. On some databases this is simple. For example in MySQL you can simply execute SET FOREIGN_KEY_CHECKS=0 in your db connection. In H2 there is an equivalent of SET REFERENTIAL_INTEGRITY FALSE. Unfortunately there is nothing like that in PostgreSQL. You will have to disable all constraints individually and then painfully recreate them. In our case these were nearly a hundred constraints and I didn't want to go down that route.

The latest DBunit comes with a nice DatabaseSequenceFilter to automatically sort the tables being dumped in an order that respects the constraints. That worked very well for all constraints across tables, but of course failed to sort the individual records in tables which contain a self reference, for example the taxonomy table which has an adjacency list via parent_fk. Luckily I had only one table like this and that included already some nested sets indices (lft,rgt) that allowed me to sort the records in a parent first order. For this I had to issue a custom SQL query though, so I ended up dumping the entire database with all tables using the filter and in addition to export only one table with a custom sql that I then had to manually copy into the complete xml dump file. Voila, finally a working DbUnit test file!

DatabaseDrivenTest for MyBatis Services
In order to load the test data into a test DB for every JUnit test we decided to use a JUnit Rule definition that is executed before each test. The class responsible for most of the magic is DatabaseDrivenTestRule which is parameterized for the specific MyBatis Service to be tested. It is generic and can be used with any database system. The subclass DatabaseDrivenChecklistBankTestRule<T> then adds the database specific configurations and can be used as a Rule within the individual tests.

A simple & clean integration test example does look like this now:

public class ReferenceServiceMyBatisTest {

  @Rule
  public DatabaseDrivenChecklistBankTest ddt = new DatabaseDrivenChecklistBankTest(ReferenceService.class, "squirrels-full.xml");

  @Test
  public void testGet() {
    Reference ref = ddt.getService().get(37);
    assertEquals("Wilson, D. E. ; Reeder, D. M. Mammal Species of the World", ref.getCitation());
    assertEquals(100000025, ref.getUsageKey());
  }

Isn't that gorgeous? We only need to pass the dbunit test file and the service class to be tested to the JUnit Rule and then only need to bother with testing the service results! No additional setting up or tearing down is needed.

Thursday 20 October 2011

GBIF Portal: Geographic interpretations

The new portal processing is about to go into production, and during testing I was drawing some metrics on the revised geographic interpretation. It is a simple issue, but many records have coordinates that contradict the country that the record claims to be in. Some illustrations of this were previously shared by Oliver.

The challenge of this is two fold. Firstly we see many variations in the country name which needs to be interpreted. Some examples for Argentina are given (there are 100s of variations per country):

Argent.
Argentina
Argentiana
N Argentina
N. Argentina
ARGENTINA
ARGENTINIA
ARGENTINNIA
"ARGENTINIA"
""ARGENTINIA""
etc etc

We have abstracted the parsing code into a separate Java library which makes use of basic algorithms and dictionary files to help interpret the results. This library might be useful for other tools requiring similar interpretation, or data cleaning efforts, and will be maintained over time as it will be in use in several GBIF tools.

The second challenge is that we need to determine if the point falls within the country. There is always room for improvement in this area, such as understanding changes over time, but due to the huge volume of outliers when using the raw data a check like this is required. Our implementation is a very basic reverse georeferencing RESTful web service that takes a latitude and longitude, and returns the proposed country and some basic information such as the title. Operating the service requires PostGIS and a Java server like Apache Tomcat. Currently we make use of freely available terrestrial shapefiles, and marine economic exclusion zones. It would be trivial to expand the service to use more shapefiles for other uses, and is expected to happen over time. Currently the GBIF service is an internal only processing service, but is expected to be released for public use in the coming months.

Improving the country name interpretation and making use of a more accurate geospatial verification service than previously will help improve data reporting at the national level using the GBIF portal as indicated here.

		# Records	# Georeferenced
Argentina	Previously	665,284	284,012
Argentina	Now	680,344	303,889
United States	Previously	79,432,986	68,900,415
United States	Now	81,483,086	70,588,182

Friday 7 October 2011

Group synergy

During the last few weeks we have been intensively designing and implementing what would come to be the new data portal. Oliver described nicely the new stage our team has entered in his last blog post Portal v2 - There will be cake. As my personal opinion, I think this has been truly a group experience as we have decided to change our paradigm of working. Normally we would have worked on different components each one of us and later try to integrate everything, but now we took the approach of just focusing on one subcomponent, all of us, and driving our efforts into it. From my point of view, the main advantage of this is that we avoid the Bus Factor element, that we as a small group of developers, are quite exposed to. Communication has increased among our team as we are all on the same page now.

As a general overview, the portal v2 will consist of different subcomponents (or sub-projects) that would need to interact between them to come up with a consolidated "view" for the user. Currently we have 3 different sub-projects on our tray, Checklist Bank, Registry, and Occurrence Store and our plan will be to have an API (exposed through web services) which will offer all data necessary (from these projects) for the portal to consume. The portal will then need to make use of a simple webservice client to communicate with this API.

Currently we have been working on the Checklist Bank sub-project. As Oliver pointed out in his previous post, some members of our team are more familiarized with certain sub-projects, and the checklist one does not escape from this reality. So for many, including me, it has been a learning experience. We have started development following very strict guidelines on API design and code conventions (which we document internally for our use). Even decisions that are sometimes taken in seconds by a single developer, are placed under group scrutiny so we are all on the same track. We have taken the commitment to apply the best coding practices.

Specifically on the checklist sub-project, we have come up with a preliminary API. Please note this API won't be exposed to the public as it is. It is subject to change as we try to refine it. It is just nice to show to the outside world what we have been working on.

I personally think we are in exciting times inside GBIF and that the final product of all this effort would be a great tool that would benefit the community in big ways. Expect more from us!

Tuesday 27 September 2011

Portal v2 - There will be cake

The current GBIF data portal was started in 2007 to provide access to the network's biodiversity data - at the time that meant a federated search across 220 providers and 76 million occurrence records. While that approach has served us well over the years, there are many features that have been requested for the portal that weren't addressable in the current architecture. Combined with the fact that we're now well over 300 million occurrence records, with millions of new taxonomic records to boot, it becomes clear that a new portal is needed. After a long consultation process with the wider community the initial requirements of a new portal have been determined, and I'm pleased to report that work has officially started on its design and development.

For the last 6 months or so the development team has been working on improving our rollover process, registry improvements, IPT development, and disparate other tasks. The new portal marks an important milestone in our team development as we're now all working on the portal, with as little distraction from other projects as we can manage. Obviously we're still fixing critical bugs and responding to data requests, etc, but all of us focusing on the same general task has already shown dividends in the conversations coming out of our daily scrums. Everyone being on the same page really does help.

And yes, we've been using daily stand-up meetings that we call "scrums" for several months, but the new portal marks the start of our first proper attempt at agile software development, including the proper use of scrum. Most of our team has had some experience with parts of agile techniques, so we're combining the best practices that everyone has had to make the best system for us. Obviously the ideal of interchangeable people with no single expert in a given domain is rather hard for us when Tim, Markus, Kyle and Jose have worked on these things for so long and people like Lars, Federico and I are still relatively new (even though we're celebrating our one year anniversaries at GBIF in the next weeks!), but we're trying hard to have non-experts working with experts to share the knowledge.

In terms of managing the process, I (Oliver) am acting as Scrum Master and project lead. Andrea Hahn has worked hard at gathering our initial requirements, turning them into stories, and leading the wireframing of the new portal. As such she'll be acting as a Stakeholder to the project and help us set priorities. As the underlying infrastructure gets built and the process continues I'm sure we'll be involving more people in the prioritization process, but for now our plates are certainly full with "plumbing". At Tim's suggestion we're using Basecamp to manage our backlog, active stories, and sprints, following the example from these guys. Our first kickoff revealed some weaknesses in mapping Basecamp to agile, and the lack of a physical storyboard makes it hard to see the big picture, but we'll start with this and re-evaluate in a little while - certainly it's more important to get the process started and determine our actual needs rather than playing with different tools in some kind of abstract evaluation process. Once we've ironed out the process and settled on our tools we'll also make them more visible to the outside world.

We're only now coming up on the end of our first, 2 week sprint, so it will take a few more iterations to really get into the flow, but so far so good, and I'll report back on our experience in a future post.

(If you didn't get it, apologies for the cake reference)

Thursday 15 September 2011

VertNet and the GBIF Integrated Publishing Toolkit

(A guest post from our friends at VertNet, cross-posted from the VertNet blog)

This week we’d like to discuss the current and future roles of the GBIF Integrated Publishing Toolkit (IPT) in VertNet. IPT is a Java-based web application that allows a user to publish and share biodiversity data sets from a server. Here are some of the things IPT can do:

GBIF IPT Logo Image

Create Darwin Core Archives. In our post about data publishing last week, we wrote about Darwin Core being the “language of choice” for VertNet. IPT allows publishers to create Darwin Core data records from either files or databases and to export them in zipped archive files that contain exactly what is needed by VertNet for uploading.

Make data available for efficient indexing by GBIF. VertNet has an agreement with its data publishers that, by participating, they will also publish data through GBIF. GBIF keeps our registry of data providers and uses this registry to find and update data periodically from the original sources to make it available through the GBIF data portal. IPT gives data publishers an easy means of keeping their data up-to-date with GBIF.

IPT can help with the data publishing process in other ways as well:

standardizing terms
validating records before they get published
adding default values for fields that aren’t in the original data

To get a better understanding of the capabilities, take a look at the IPT User Manual.

Why are we using IPT?

VertNet has a long waiting list of organizations (65 to date) that have expressed interest in making their data publicly accessible through VertNet. In the past, these institutions would have needed their own server and specialized software (DiGIR) for publishing to the separate vertebrate networks. We’d rather not require any of these participants to buy servers if they don’t have to. As an interim solution, we’re using the IPT to make data available online while we build VertNet. We have installed, at the University of Kansas Biodiversity Institute, an IPT that can act as a host for as many collections as are interested. The service is shared, yet organizations can maintain their own identity and data securely within this hosted IPT. This is a big win for us at VertNet, because there will be fewer servers to maintain and we can get more collections involved more quickly.

Going forward…

Well before completion, VertNet will support simple and sustainable publishing by uploading records from text files in Simple Darwin Core form. Because of this, the IPT will not be a required component of data publishing for VertNet. Rather, we see IPT as a great tool to facilitate the creation of Darwin Core Archives, which we will be able to use to upload data to VertNet.

Interested in publishing now with IPT?

We currently have two institutions sharing their collections with VertNet and GBIF through the VertNet IPT and we’re in the process of working with several more.

So, if you are or would like to be a vertebrate data publisher and would like to make your data accessible as Darwin Core Archives sooner rather than later, VertNet’s IPT might be the solution for you! Learn more about the process on the VertNet web site or email Laura Russell and Dave Bloom.

Posted by Laura Russell, VertNet Programmer; John Wieczorek, Information Architect; and Aaron Steele, Information Architect

Friday 26 August 2011

Darwin Core Archives for Species Checklists

GBIF has long had an ambition for supporting the sharing of annotated species checklists through the network. Realising this ambition has been frustrated by the lack of a data exchange standard of sufficient scope and simplicity as to promote publication of this type of resource. In 2009, the Darwin Core standard data set was formerly ratified by the TDWG, Biodiversity Information Standards. The addition of new terms, and a means of expressing these terms in a simplified and extensible text-based format, paved the way for the development of a data exchange profile for exchanging species checklists known as the Global Names Architecture (GNA) Profile. Species checklists, published in this format, can be zipped into single, portable, 'archive' files.

Here I introduce two example archives that illustrate the flexible scope of the format. The first represents a very simple species checklist while the second is a more richly documented taxonomic catalogue. The contents of any file can be viewed by clicking on the file icon or filename. A complete list of terms used in sharing checklists can be found here.

Example 1: **U.S. National Arboretum Checklist**
This checklist represents the most simple checklist archive. It consists of a document that describes the checklist and a second file with the checklist data itself. The checklist data consist of two columns. Note that by including column headers that match the standard DarwinCore term names, that no additional mapping document is needed.
EML.xml	The checklist is documented using an Ecological Metadata Language (EML) document.
Checklist.txt	The checklist itself is kept in this simple text file.

Example 2: **Catalog of Living Whales**
This checklist represents an annotated species checklist. In addition to the core species list ('whales.tab') there are numerous other data types consisting of Darwin Core extensions that conform to the GNA Profile. This more complex archive contains a resource map file ('meta.xml') that describes the files in the archive. An EML metadata document describes the catalog itself. This more complex archive uses a common identifier, taxonID, to link data in the extension files to the data records in the core species checklist ('whales.tab').
EML.xml	The checklist is documented using a Ecological Metadata Language (EML) document. It includes a title, contacts, citation information and more.
whale s.tab	The checklist itself is kept in this tab-delimited file.
meta.xml	The files in the archive are described in this resource map file.
distribution.tab	Distribution information conforming to the GNA Distribution extension are stored in this file.
references.tab	Bibliographic references are stored in this file and linked to 'whales.tab' via the taxonID
types.tab	Type specimen details are contained in this file.
vernaculars.tab	Common name information that conforms to the GNA Vernacular Extension are stored in this file.

Sunday 21 August 2011

Configuring Drupal and some modules for ticketing emails

We at the Secretariat receive enquiries via helpdesk[at]gbif[dot]org, portal[at]gbif[dot]org and info[at]gbif[dot]org, everyday, or I would say, almost every hour. Some of them are provider-specific questions that need special attention from staff, while some others are FAQs. We have been thinking about better managing questions/issues, so by adding a little bit structure in the collaborative workflow, we can:

1. Make sure questions are answered with satisfaction; 2. Estimate how much man hours have been spent, or evaluate performance; 3. Improve efficiency on helpdesk activities.

To achieve these, we need softwares that meet these requirements:
1. Case management for incoming emails;
2. A Q&A cycle should be completed by solely using email. Web forms are good but not necessary in the beginning;
3. Easy configured knowledge base essays;
4. Graphical reports shows the helpdesk performance;
5. Automatic escalation of case status.

We looked for options from Open Source Help Desk List. While most of the sounding choices are tailored for software development cycle, some are commercial packages/services that indeed designed for enterprise help desk needs. While evaluating a few of those packages, I also found with Drupal and some modules, a solution that just meets our need is pretty out-of-box ready. The result is quite convincing and I can imagine the transition won't require too much learning of my colleagues.

Here is the recipe.

Materials and methods:

1. A mail server. All right I admit this is not something easy if you're not a system administrator. We use Dovecot to provide IMAP access to emails.
2. A Drupal installation. Installation instructions are here. As a wimp I choose version 6.
3. The Support module. Downloadable at http://drupal.org/project/support.
support_deadline, support_fields, support_timer, support_views, support_token, and support_nag are relevant modules that fit our purposes.
4. The CCK module. Downloadable at http://drupal/project/cck.
5. The Views module. Downloadable at http://drupal/project/views.
6. The Google Chart module. Downloadable at http://drupal.org/project/chart. Not "charts", which is a different module.
7. The Date module. Downloadable at http://drupal.org/project/date.
8. The Admin Menu module, for your administrative pleasure. Downloadable at http://drupal.org/project/admin_menu.
9. The Views Calc module, required by Support modules. Downloadable at http://drupal.org/project/views_calc.
10. Download all necessary modules to [drupalroot]/sites/all/modules directory. Enable them at [baseURL]/admin/build/modules.
11. You should see a "support ticketing system" menu by now. You need to

Add an email client with an email account you set on the dovecot mail server;
You probably want to change the email template at [baseURL]/admin/support/settings/mail;
Go through the general settings of the ticketing system at [baseURL]/admin/support/settings.

12. Send some testing emails to the testing email address.
13. Visit [baseURL]/admin/support/clients/1/fetch, see if the system retrieve email and create tickets successfully.

14. After more testing emails have been sent to the address and fetched. You can visit [baseURL]/admin/support/charts.

These are just some facets of my explorations so far. Some details are not covered, like permissions in Drupal. Probably a newbie would need a crash course of Drupal to start, but after that things will be easier and faster.

Friday 12 August 2011

Using C3P0 with MyBatis

The problem

In our rollover process, which turns our raw harvested data into the interpreted occurrences you can see on our portal, we now have a step that calls a Web Service to turn geographical coordinates into country names. We use this to enrich and validate the incoming data.

This step in our process usually took about three to four hours but last week it stopped working all together without any changes to the Web Service or the input data.

We've spent a lot of time trying to find the problem and while we still can't say for sure what the exact problem is or was we've found a fix that works for us which also allows us to make some assumptions about the cause of the failure.

It is a project called geocode-ws and it is a very simple project that uses MyBatis to call a PostgreSQL (8.4.2) & PostGIS (1.4.0) database which does the GISy work of finding matches.

Our process started out fine. The first few million calls to the Web Service were fine and returning reasonably fast but then at the end the process slowed down until it came almost to a complete stop with response times of over 10 minutes. That's when our Hadoop maps timed out and failed.

With hindsight we should have come to our final conclusion much earlier but it took us a while.

Looking for the problem

We've looked at the PostgreSQL configuration and tweaked it a lot. We added a lot more logging and we've made sure to log any long running statements (using the log_min_duration_statement option). We also made sure that our memory settings are sensible and that we don't run out of memory. Looking at io- and vmstats as well as our Cacti monitoring we could see that this wasn't the case though. PostgreSQL didn't seem to be the problem.

We also looked at the OS configuration itself as well as the connectivity between our Hadoop cluster and this Tomcat and PostgreSQL server but couldn't find the problem either.

Then we began improving our Web Service and implemented a JMX MBean to get more detailed information about the process. While our changes should have improved the code base they didn't fix the problem. Finally we enabled GC logging on our Tomcat server (something we should have done much earlier and we will probably do by default for our servers in the future). We didn't do it earlier because the Web Service didn't experience any symptoms of memory leak issues before and we didn't change anything there. It hadn't even been restarted in a while.

But as it turned out the problem was garbage collection. Unfortunately I can't provide pretty graphs because I've overwritten the GC logs but it was very easy to see (using the awesome GCViewer) a typical pattern of minor collections not reclaiming all space and growing memory usage up until the point where almost no memory could be reclaimed and most of the time was spent in Garbage Collection. We found the problem! This explained our time outs.

It still doesn't explain what was leaking though. And having spent that much time on it we quickly gave up trying to find the problem. We suspect some kind of combination between the MyBatis Connection Pool, the PostgreSQL JDBC driver and our configuration.

Our workaround (the MyBatis & C3P0 part)

So we looked around for other connection pools for use with MyBatis but unfortunately we couldn't find a ready made thing. There are implementations in the MyBatis-Guice project but they can only be used with the Annotation based configuration and we're using XML.

We ended up writing our own implementation of a C3P0 DataSourceFactory and it turned out to be very very easy: It is just one class (JavaDoc here) with one line of code in it.

This not only solved our apparent memory leak but the performance increased by a factor of two to three as well. We haven't had a problem with our setup since.

Conclusion

We didn't have the time to find the real problem but we found a solution that works for us. I suspect had we gone about this better we might have found the problem a lot sooner and perhaps identified the real reason for it.

Enable GC logging!
Enable JMX for Tomcat and set up your applications with useful metrics and logging
Even though the use of Profilers is heavily disputed they can often help. We've found YourKit to be excellent
Try to follow a logical route, change only one thing at a time, mock things to find a problem
Monitor and graph your systems

Wednesday 3 August 2011

Indexing occurrences data - using Lucene technology

The GBIF Occurrence Index collects, stores and parses data gathered from different sources to provide a fast and accurate access to biodiversity occurrence data. The purpose of having a GBIF Index is optimize speed, relevance and performance of search functionalities that will be implemented by the new GBIF portal architecture.

Currently, GBIF has been providing search functionalities in its Data Portal supported in a semi-denormalized index relational database design, which allows find occurrence information by specifying filters to refine the expected results. That design was envisioned to support use cases of the actual GBIF Data Portal (a Web application); for the next generation of the GBIF platform, a new set of requirements must be achieved and is possible that the current index will not be able to support them, the most relevant of those requirements are: scheduling of batch exports, full text search, realtime faceted search and probably new schemas of data sharing with other biodiversity networks.

For implementing this new Occurrence Index, several technologies are under evaluated, each technology taken into considerationfor specific features that make them an attractive option, those are:

Technology	Description
PostgreSQL	This relational data base contains several features that worth evaluate: query optimization for JOIN-like queries, flexible key-value store, partial indices and multicolumn indices
Lucene Index	At least four options are available for this implementation: pure Lucene Index, Katta, Apache Solr and ElasticSearch
MySQL	This is the current implementation of the index, a evaluation could help to determine if this technology will be able to support new use cases
Key-value systems	Several schema-less data stores are available: CouchDB, Mongo, PostgreSQL hstore. The main concern about these technologies is their capabilities to handle a considerable amount of records

In this post will show some preliminary results in the evaluation of Lucene-based indices, specifically: Lucene as itself, Katta, Apache Solr and ElasticSearch. The analysis will keep apart two concerns (in this post only the index creation is cited):

Index creation, means how the index is created, split (in shards) and merged if necessary.
Index use, refers in how the index performs in terms of usability (queries and search patterns), performance (response time) and througput.

Index Creation

Three scenarios were considered to the index creation phase:

Single process - n Indices: In this case a single process creates n-shards, the input data are split evenly; an IndexWriter is created for each shard. The case of n = 1 is considered part of this scenario, the # of shards is a parameter defined by the user, and is equal to the # of expected shards at the end of the process.
N threads - n Indices: The IndexWriter is a thread safe class, so it can be shared by several threads in order to create a single Index. The # of shards is defined by the user and internally is used to define the # of IndexWriters.
Distributed Index creation: in this case the index is created by splitting the input data into N shards, each shard is assigned to one process that contains a single IndexWriter which is responsible for the index creation.

Single process index creation

The process followed for this scenario is pretty straightforward:

The input is a row delimited file and each column is separated by a special character ('/001' in our case.
The # of shards input defines the number IndexWriters (only one IndexWriter can be opened for a Lucene Index).
Each row represents a Lucene document and is stored using one of the available index writers.
If multiple indices were created, at the final step those are merged into a single index (using "IndexWriter.addIndexesNoOptimize")

This process was tested using a 100 millions of records file. The entire process took 9200821 milliseconds (= 9200.821 seconds = 153.347016666666667 minutes = 2.555783611111111 hours) to finish. Some optimizations were implemented for this process, worth mention that the same sort of optimizations were applied for the multithread scenario:

Re-use the IndexWriter in multiple threads

Re-use the org.apache.lucene.document.Document and org.apache.lucene.document.Field instances. The Lucene fields are created in a static block and its value is changed for each new Document, then the document is added to the index. The intention of this is avoid the objects to be garbage collected.

  //Initialization
  static {
    for (int i = 0; i < accFieldsValues.length; i++) {
      fields[i] = new Field(accFieldsValues[i].name(), "", Store.YES, Index.ANALYZED);
    }
  }
 ...
 //Sets the field value
 fields[fieldsCount].setValue(stringTokenizer.nextToken());
 ...
 //Adds the same document instance with different values
 indexWriter.addDocument(doc);

The compound file format is turned off, this reduces the amount of files opened at the same time.
IndexWriter.autocommit is set to false: since the index doesn't provide searching during the creation time, this feature can be disable.

The flush is done by RAM, and the RAM usage is maximized:

 
LogByteSizeMergePolicy logByteSizeMergePolicy = new LogByteSizeMergePolicy(); logByteSizeMergePolicy.setMergeFactor(mergeFactor);
...
indexWriterConfig.setRAMBufferSizeMB(bufferSize);

Every N documents a entry is written in a log in order to notify the overall progress.

Multithreaded Index Creation

In terms of optimizations applied this scenario is very similar to the "Single process" scenario. However, the process is very different in terms of steps and the resulted index:

The input is a row delimited file and each column is separated by a special character ('/001' in our case).
The number of rows of the input file is known and is passed as input parameter.
The input file is split evenly in intermediate files, each file is assigned to a thread which will read it to create a Lucene Index.
The intermediate files are deleted after each index is created.
Depending of the number of shards desired , the indices are merged in smaller set of indices.

This process was run using: a pool of 50 threads and an input file with 100 million of rows. The execution time is detailed in the next table:

Phase	Time
Slicing (split the input file and distribute it in the threads)	1045948ms == 17.43 minutes
Indices creation	6890988 == 114.8498 minutes
Total time	132.2798 minutes = 2.204663333333333 hours

Katta for Index Creation

Katta is a distributed storage of indices, currently supports 2 types of indices: Lucene and Hadoop MapFiles. It uses ZooKeeper to coordinate the index creation, replication and the search across the nodes.

Main relevant features

An Katta index is basically a folder containing sub-folders (shards)
The client-node communication is implemented using HadoopRPC
Supports distributed scoring, each search query requires two network roundtrips: get document frequencies in all shards and second perform the query.
Provides functionality to merge indices (though is not a very comple task to implement using standard Lucene libraries)

Relevant issues found

Documentation is not extensive and lacks of necessary detail.
Small community and the development is very low: last commit was done in 2009-04-2.
Doesn't provide any help to create the indeces, index sharding must be done prior to import them into Katta.

Test environment configuration

A cluster of 4 Katta nodes in 2 servers was used.
The index was split in 8 shards.
The master configuration is replicated in each node using passphraseless ssh access between master and nodes.

The ZooKeeper server was embedded into the Katta master node.

 katta.zk.properties (file) ==>zookeeper.embedded=true

Each node contains 2 shards, each shard is replicated in 2 nodes.
The Lucene sharded index contains 100 millions of documents and was stored at the Hadoop distributed files system.

Index creation

Since Katta doesn't provide any functionality to create a Lucene from the scratch, the index was built using a multithread application and the were copied into Hadoop DFS. Then, the sharded index was imported into Katta using the command line:

bin/katta addIndex occurrence hdfs://namenode:port/occurrence/shardedindex/ 2

("2" means a replication factor of 2). Importing a index into Katta is just a matter of copy the file from Hadoop and update the Index status in the ZooKeeper server, so the index creation is external factor to Katta. In a next post the "Distributed Index Creation" scenario will be analyzed as well as the technologies ElasticSearch and Solr for index creation...

Monday 25 July 2011

Customizing the IPT

One of my responsibilities as the Biodiversity Informatics Manager for Canadensys is to develop a data portal giving access to all the biodiversity information published by the participants of our network. A huge portion of this task can now be done with the GBIF Integrated Publishing Toolkit version 2 or IPT. The IPT allows to host biodiversity resources, manage their data and metadata, and register them with GBIF so they can appear on the GBIF data portal, which are all targets we want to achieve. Best of all, most management can be done by the collection managers themselves.

I have tested the IPT thoroughly and I am convinced the GBIF development team has done an excellent job creating a stable tool I can trust. This post explains how I have customized our IPT installation to integrate it with our other Canadensys websites.

Background

Our Canadensys community portal is powered by WordPress (MySQL, PHP), while our data portal - which before the IPT installation only consisted of the Database of Vascular Plants of Canada (VASCAN) - is a Tomcat application. We are using different technologies because we want to use the most adequate technology for a certain website. WordPress (or Drupal for that matter) is an excellent and easy-to-use CMS, perfect for our community portal, but not suitable for a custom made checklist website like VASCAN. To the user however, both websites look the same:

We do this by using the same HTML markup and CSS for both websites. If you want to learn HTML and CSS, w3schools provides excellent tutorials.

The HTML markup defines elements on a page (e.g. header, menu, content, sidebar, footer) and the CSS stylizes those elements (e.g. their position and color). The CSS is typically stored as one file (e.g. style.css) which is referenced in the <head> section of a page. For dynamic websites, the HTML is typically stored as different files, one for each section of a page (e.g. header.php, sidebar.php). Those files are combined as one page by the server if a page is requested. That way, changing a common element on all pages of a website (e.g. the header) can be done by changing just one file.

All of this also applies to the IPT. Here's how the IPT looks like without CSS:

Attempt 1 - Editing the CSS and logo

My first attempt at customizing the IPT was at the Experts Workshop in Copenhagen, by changing the CSS and logo only, which you can find in the /styles folder of your IPT installation:

/styles/main.css

/styles/logo.jpg

In 15 minutes, my IPT was Canadensys red and had a custom logo:

Attempt 2 - Editing the FreeMarker files

Even though my IPT now had its own branding, it was still noticeably different from the other Canadensys websites. The only way I could change that, was by editing the HTML as well. Luckily, the sections I wanted to change were all stored as FreeMarker files in the /inc folder:

/WEB-INF/pages/inc/header.ftl - the <head> section

/WEB-INF/pages/inc/menu.ftl - the header, menu and sidebar

/WEB-INF/pages/inc/footer.ftl - the footer

/WEB-INF/pages/inc/header_setup.ftl - the header during installation

I incorporated the HTML structure I use for the VASCAN website into menu.ftl (including the header, menu, container and sidebar), making sure I did not break any of the IPT functionality.

I started doing the same with main.css by replacing chunks of now unused IPT CSS with CSS I copied over from VASCAN, but I quickly realized that this wasn't the best option. Doing so would result in 2 CSS files: one for VASCAN and one for IPT, even though both web applications are under the same domain name with a lot of shared CSS. It would be easier if I only had to maintain a single stylesheet, used by both applications.

Attempt 3 - One styles folder for the data portal

I created a /common/styles folder under ROOT, where I placed my single common data portal stylesheet: /common/styles/common.css. This would be the CSS file I could use for IPT and VASCAN. I did the same for my favicon: /common/images/favicon.png.

I added a reference to both files in the header.ftl of my IPT (and VASCAN):

<link rel="stylesheet" type="text/css" href="${baseURL}/styles/main.css">

<link rel="stylesheet" type="text/css" href="http://data.canadensys.net/common/styles/common.css">

<link rel="shortcut icon" href="http://data.canadensys.net/common/images/favicon.png">

As you can see on the first line, I kept the reference to the default IPT stylesheet: ${baseURL}/styles/main.css (it's perfectly fine to reference more than one CSS file). This is where I would keep all the unaltered (=default) IPT CSS. In fact, I'm not removing anything from the default IPT stylesheet, I'm only commenting out the CSS that is unused or conflicting:

/* Unused or conflicting CSS */

The advantage of doing so, is that I now easily can compare this commented file with changes in the stylesheet of any new IPT version.

After I had done everything, my IPT now looked like this:

My IPT is now sporting the Canadensys header, footer and sidebar (only visible when editing a resource), making it indistinguishable from the other Canadensys websites. It is also using a more readable font-size (13.5px) and a fluid width.

Closing remarks

I have (re)designed quite a lot of websites, and very often I have been so frustrated with the HTML and CSS that I just started over from scratch. I didn't have that option here and it wasn't necessary either. I would like to thank the GBIF development team for creating such an easily customizable tool, with logical HTML and CSS. As a reminder, the whole customization has been done by editing only 5 files (links show default files):

/styles/main.css (custom file)

/WEB-INF/pages/inc/header.ftl

/WEB-INF/pages/inc/menu.ftl

/WEB-INF/pages/inc/footer.ftl

/WEB-INF/pages/inc/header_setup.ftl

Important: Remember that installing a new IPT version will overwrite all the customized files, so make sure to back them up! I will try to figure out a way to reapply my customization automatically after an update and post about that experience in a follow-up post. In the meantime, I hope that this post will help others in the customization of their IPT.

Monday 18 July 2011

Working with Scientific Names

Dealing with scientific names is an important regular part of our work at GBIF. Scientific names are highly structured strings with a syntax governed by a nomenclatural code. Unfortunately there are different ones for botany, zoology, bacteria, virus and even cultivar names. When dealing with names we often do not know to which code or classification it belongs to, so we need to have a code agnostic representation as much as possible. GBIF came up with a structured representation which is a compromise focusing on the most common names, primarily the botanical and zoological names which are quite similar in its basic form.

The ParsedName class

Our ParsedName class provides us with the following core properties:

 genusOrAbove
infraGeneric
specificEpithet
rankMarker
infraSpecificEpithet
authorship
year
bracketAuthorship
bracketYear

These allow us to represent regular names properly. For example Agalinis purpurea var. borealis (Berg.) Peterson 1987 is represented as

 genusOrAbove=Agalinis
specificEpithet=purpurea
rankMarker=var.
infraSpecificEpithet=borealis
authorship=Peterson
year=1987
bracketAuthorship=Berg.

or the botanical section Maxillaria sect. Multiflorae Christenson as

 genusOrAbove=Maxillaria
infraGeneric=Multiflorae
rankMarker=sect.
authorship=Christenson

Especially in botany you often encounter names with authorships for both the species and some infraspecific rank or names citing more than one infraspecific rank. These names are not formed based on rules or recommendations from the respective codes and we ignore those superflous parts. For example Agalinis purpurea (L.) Briton var. borealis (Berg.) Peterson 1987 is represented exactly the same as Agalinis purpurea var. borealis above. In case of 4 parted names like Senecio fuchsii C.C.Gmel. subsp. fuchsii var. expansus (Boiss. & Heldr.) Hayek only the lowest infraspecific rank is preserved:

 genusOrAbove=Senecio
specificEpithet=fuchsii
rankMarker=var.
infraSpecificEpithet=expansus
authorship=Hayek
bracketAuthorship=Boiss. & Heldr.

Hybrid names are evil. They come in two flavors, named hybrids and hybrid formulas.

Named hybrids are not so bad and simply prefix a name part with the multiplication sign ×, the hybrid marker, or prefix the rank marker of infraspecific names with notho. Strictly this symbol is not part of the genus or epithet. To represent these notho taxa our ParsedName class contains a property called nothoRank that keeps the rank or part of the name that needs to be marked as with the hybrid sign. For example the named hybrid Pyrocrataegus ×willei L.L.Daniel is represented as

 genusOrAbove=Pyrocrataegus
specificEpithet=willei
authorship=L.L.Daniel
nothoRank=species

Hybrid formulas such as Agrostis stolonifera L. × Polypogon monspeliensis (L.) Desf., Asplenium rhizophyllum × ruta-muraria or Mentha aquatica L. × M. arvensis L. × M. spicata L. cannot be represented by our class. The hybrid formulas in theory can combine any number of names or name parts, so its hard to deal with them. Luckily they are not very common and we can afford to live with a complete string representation in those cases. Yet another "extension" to the botanical code are cultivar names, i.e. names for plants in horticulture. Cultivar names are regular botanical names followed by a cultivar name usually in english given in single quotes. For example Cryptomeria japonica 'Elegans'. To keep track of this we have an additional cultivar property, so that:

 genusOrAbove=Cryptomeria
specificEpithet=japonica
cultivar=Elegans

In taxonomic works you often have additional information in a name that details the taxonomic concept, the sec reference most often prefixed by sensu or sec. For example Achillea millefolium sec. Greuter 2009 or Achillea millefolium sensu latu. In nomenclatoral works one frequently encounters nomenclatoral notes about the name such as nom.illeg. or nomen nudum. Both these informations are hold in our ParsedName class, for example Solanum bifidum Vell. ex Dunal, nomen nudum becomes

 genusOrAbove=Solanum
specificEpithet=bifidum
authorship=Vell. ex Dunal
nomStatus=nomen nudum

Reconstructing name strings

The ParsedName class provides us with some common methods to build a name string. In many cases you dont want the complete name with all its details, so we offer some popular name string types out of the box and a flexible string builder that you can explicitly tell which parts you want to include. The most important methods are canonicalName(): builds the canonical name sensu strictu with nothing else but the three name parts at max (genus, species, infraspecific). No rank, hybrid markers or authorship information are included. fullName(): builds the full name with all details that exist. canonicalSpeciesName(): builds the canonical binomial in case of species or below, ignoring infraspecific information

The name parser

We decided at GBIF that sharing the complete name string is more reliable than trusting already parsed names. But parsing names by hand is a very tedious enterprise, so we needed to develop some parser that can handle the vast majority of all names that we encounter. After a short experimental phase with BNF and other grammars to automatically build a parser we decided to go back to start and start something based on good old regular expressions and plain java code. The parser has evolved now for nearly 2 years now and it might be the best unit tested class we have ever written at GBIF. It is interesting to take a look at the range of names we use for testing and also the test themseves to make sure its working as expected.

Parsing names

Using the NameParser in code is trivial. Once you create a parser instance all you need to do is call the parser.parse(String name) method to get your ParsedName object. As authorships are the hardest, ie most variable part of a name we have actually implemented two parsers internally. One that tries to parse the complete string and another fallback one that ignores authorships and only extracts the canonical name. The authorsParsed flag on a ParsedName instance tells you if the simpler fallback parser has been used. If a name cannot be parsed at all an UnparsableException is thrown. This is also the case for viral names and hybrid formulas, as the ParsedName class cannot treat these names. The exception itself actually has an enumerated property that you can use to know if the exception has been caused by a virus, hybrid or other name. As of today from 10.114.724 unique name strings that we have indexed only 116.000 names couldnt be parsed and these are mostly hybrid formulas.

Normalisation

Apart from the parse method the name parser also exposes a normalisation method to normalise any whitespace, commas, brackets and hybrid markers found in name strings. The parser uses this method internally before the actual parsing takes place. The string is trimmed, only single whitespace is allowed and spaces before commas are removed while it is enforced after a comma. Similar whitespace before opening brackets is added but removed inside. Instead of the proper multiplication sign for hybrids often a simple × followed by whitespace is used which is also replaced by this method.

Parsing Webservices

GBIF is offering a free webservice API to parse names using our name parser. We use JSON for the parsed results and it accepts single names as well as batches of names. For larger input data you have to use a POST request (GET requests are restricted in length), but for few names also a simple GET request with the names url encoded in the paramter "names" is accepted. Multiple names can be concatenated with the pipe | symbol. To parse the two names Symphoricarpos albus (L.) S.F.Blake cv. 'Turesson' and Pyrocrataegus willei ×libidi L.L.Daniel the parser service call looks like this: http://ecat-dev.gbif.org/ws/parser?names=Symphoricarpos%20albus%20(L.)%20S.F.Blake%20cv.%20'Turesson'|Stagonospora%20polyspora%20M.T.%20Lucas%20%26%20Sousa%20da%20Camara%201934 For manual usages we also provide a simple web client to this service that provides a form to enter names to be parsed and also accepts files with one name per line for upload. It is available as part of our tools collection at http://tools.gbif.org/nameparser/.

Source Code

All the code is part of a small java library that we call ecat-common. It is freely available under Apache 2 licensing as most of our GBIF work and you are invited to use our code at http://code.google.com/p/gbif-ecat/source/browse/trunk/ecat-common/, download the latest jar from our maven repository or include it in your maven dependencies like this:

 <repositories>
<repository>
<id>gbif-all</id>
<url>http://repository.gbif.org/content/groups/gbif</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>org.gbif</groupId>
<artifactId>ecat-common</artifactId>
<version>1.5.1-SNAPSHOT</version>
</dependency>
</dependencies>

Friday 8 July 2011

Are you the keymaster?

As I mentioned previously I'm starting work on evaluating HBase for our occurrence record needs. In the last little while that has meant coming up with a key structure and/or schema that optimizes reads for one major use case of the GBIF data portal - a user request to download an entire record set, including raw records as well as interpreted. The most common form of this request looks like "Give me all records for ", eg "Give me all records for Family Felidae".

So far I'm concentrating more on the lookup and retrieval rather than writing or data storage optimization, so the schema I'm using is two column families, one for verbatim columns, one for interpreted (for a total of about 70 columns). The question of which key to use for HTable's single indexed column is what we need to figure out. For all these examples we assume we know the backbone taxonomy id of the taxon concept in question (ie Family Felidae is id 123456).

Option 1
Key: native record's unique id

Query style: The simplest way of finding all records that belong to Family Felidae is scan all of them, and check against the Family column from the interpreted column family. The code looks like this:

    HTable table = new HTable(HBaseConfiguration.create(), tableName);

    byte[] cf = Bytes.toBytes(colFam);

    byte[] colName = Bytes.toBytes(col);

    byte[] value = Bytes.toBytes(val);

    Scan scan = new Scan();

    ResultScanner scanner = table.getScanner(scan);

    for (Result result : scanner) {

      byte[] testVal = result.getValue(cf, colName);

      if (Bytes.compareTo(testVal, value) == 0) doSomething;

    }

Because this means transferring all columns of every row to the client before checking if it's even a record we want, it's incredibly wasteful and therefore very slow. It's a Bad Idea.

Option 2
Key: native record's unique id

Query style: HBase provides a SingleColumnValueFilter that executes our equality check on the server side, thereby saving the transfer of unwanted columns to the client. Here's the code:

    HTable table = new HTable(HBaseConfiguration.create(), tableName);
    byte[] cf = Bytes.toBytes(colFam);
    byte[] colName = Bytes.toBytes(col);
    byte[] value = Bytes.toBytes(val);

    SingleColumnValueFilter valFilter = new SingleColumnValueFilter(cf, colName, CompareFilter.CompareOp.EQUAL, value);
    valFilter.setFilterIfMissing(true);

    Scan scan = new Scan();
    scan.setFilter(valFilter);
    ResultScanner scanner = table.getScanner(scan);

This is about as good as it gets until we start getting clever :)

Option 3
Key: concatentation of nub-taxonomy "left" with native record's unique id

Query style: We know that a taxonomy is a tree, and our backbone taxonomy is a well behaved (ie true) tree. We can use nested sets to make our "get all children of node x" query much faster, which Markus realized some time ago, and so thoughtfully included the left and right calculation as part of the backbone taxonomy creation. Individual occurrences of the same taxon will share the same backbone taxonomy id, as well as the left and right. One property of nested sets not mentioned in the wikipedia article is that when the records are ordered by their lefts, the query of "give me all records where left is between parent left and parent right" becomes "give me all rows starting with parent left and ending with parent right", which in HBase terms is much more efficient since we're doing a sequential read from disk without any seeking. So we build the key as leftId_uniqueId, and query as follows (note that startRow is inclusive and stopRow is exclusive, and we want exclusive on both ends):

    HTable table = new HTable(HBaseConfiguration.create(), tableName);

    Scan scan = new Scan();

    scan.setStartRow(Bytes.toBytes((left + 1) + "_"));

    scan.setStopRow(Bytes.toBytes(right + "_"));

    ResultScanner scanner = table.getScanner(scan);

Which looks pretty good, and is in fact about 40% faster than Option 2 (on average - depends on the size of the query result). But on closer inspection, there's a problem. By concatenating the left and unique ids with an underscore as separator, we've created a String, and now HBase is doing its usual lexicographical ordering, which means our rows aren't ordered as we'd hoped. For example, this is the ordering we expect:

1_1234

2_3458

3_3298

4_9378

5_3435

10_5439
100_9763

but because these are strings, HBase orders them as:

1_1234

10_5439

100_9763 

2_3458

3_3298

4_9378

5_3435

There isn't much we can do here but filter on the client side. For every key, we can extract the left portion, convert to a Long, and compare it to our range, discarding those that don't match. It sounds ugly, and it is, but it doesn't add anything appreciable to the processing time, so it would work.

Except that there's a more fundamental problem - if we embed the left in our primary key, it only takes one node added to the backbone taxonomy to force an update in half of all the lefts (on average) which means all of our primary keys get rewritten. At 300 million records and growing, that's not an option.

Option 4
Key: native record's unique id
Secondary index: left to list of unique ids

Query style: Following on from Option 3, we can build a second table that will serve as a secondary index. We use the left as a numeric key (which gives us automatic, correct ordering) and write each corresponding unique occurrence id as a new column in the row. Then we can do a proper range query on the lefts, and generate a distinct Get for each distinct id. Unfortunately building that index is quite slow, and is still building as I write this, so I haven't been able to test the lookups yet.

For those keeping score at home, I'm using Hbase 0.89 (from CDH3b4) which doesn't have built in secondary indexes (which 0.19 and 0.20 did).

I'll write more when I've learned more, and welcome any tips or suggestions you might have to aid in my quest!