Friday 21 October 2011

Integration tests with DBUnit

Database driven JUnit tests
As part of our migration to a solid, general testing framework we are now using DbUnit for database integration tests of our database service layer with JUnit (on top of liquibase for the DDL).

Creating a DbUnit test file
As it can be painful to maintain a relational test dataset with many tables, I've decided to dump a small, existing Postgres database into the DbUnit XML structure, namely FlatXML. It turned out to be less simple as I had hoped for.

First I've created a simple exporter script in Java that dumps the entire DB into XML. Simple.

The first problem I've stumbled across was a column named "order" which caused a SQL exception. It turns out DbUnit needs to be configured for specific databases, so I've ended up using three configurations to both dump and read the files.
  1. Use Postgres specific types
  2. Double quote column and table names
  3. Enable case sensitive table & column names (now that we use quoted names, Postgres becomes case sensitive)
After that, reading in the DbUnit test file started out fine, but reached a weird NullPointerException that left me puzzled. After trying various settings I finally found a log warning that some columns might not be detected properly by DbUnit, as it only inspects the first XML record by default which can contain many null columns which subsequently then will be ignored. Luckily since version 4.3.7 of dbunit you can tell the builder that reads in the test files to scan all records first in memory, a feature know as column sensing. That got me a long way, but ultimately I've hit a much harder issue. Relational integrity.

The classic way to avoid integrity checks during inserts (including DbUnit) is simply to temporarily disable all foreign key constraints. On some databases this is simple. For example in MySQL you can simply execute SET FOREIGN_KEY_CHECKS=0 in your db connection. In H2 there is an equivalent of SET REFERENTIAL_INTEGRITY FALSE. Unfortunately there is nothing like that in PostgreSQL. You will have to disable all constraints individually and then painfully recreate them. In our case these were nearly a hundred constraints and I didn't want to go down that route.

The latest DBunit comes with a nice DatabaseSequenceFilter to automatically sort the tables being dumped in an order that respects the constraints. That worked very well for all constraints across tables, but of course failed to sort the individual records in tables which contain a self reference, for example the taxonomy table which has an adjacency list via parent_fk. Luckily I had only one table like this and that included already some nested sets indices (lft,rgt) that allowed me to sort the records in a parent first order. For this I had to issue a custom SQL query though, so I ended up dumping the entire database with all tables using the filter and in addition to export only one table with a custom sql that I then had to manually copy into the complete xml dump file. Voila, finally a working DbUnit test file!

DatabaseDrivenTest for MyBatis Services
In order to load the test data into a test DB for every JUnit test we decided to use a JUnit Rule definition that is executed before each test. The class responsible for most of the magic is DatabaseDrivenTestRule which is parameterized for the specific MyBatis Service to be tested. It is generic and can be used with any database system. The subclass DatabaseDrivenChecklistBankTestRule<T> then adds the database specific configurations and can be used as a Rule within the individual tests.

A simple & clean integration test example does look like this now:
public class ReferenceServiceMyBatisTest {

  public DatabaseDrivenChecklistBankTest ddt = new DatabaseDrivenChecklistBankTest(ReferenceService.class, "squirrels-full.xml");

  public void testGet() {
    Reference ref = ddt.getService().get(37);
    assertEquals("Wilson, D. E. ; Reeder, D. M. Mammal Species of the World", ref.getCitation());
    assertEquals(100000025, ref.getUsageKey());

Isn't that gorgeous? We only need to pass the dbunit test file and the service class to be tested to the JUnit Rule and then only need to bother with testing the service results! No additional setting up or tearing down is needed.

Thursday 20 October 2011

GBIF Portal: Geographic interpretations

The new portal processing is about to go into production, and during testing I was drawing some metrics on the revised geographic interpretation.  It is a simple issue, but many records have coordinates that contradict the country that the record claims to be in.  Some illustrations of this were previously shared by Oliver.

The challenge of this is two fold.  Firstly we see many variations in the country name which needs to be interpreted.  Some examples for Argentina are given (there are 100s of variations per country):

  • Argent.
  • Argentina
  • Argentiana
  • N Argentina
  • N. Argentina
  • etc etc
We have abstracted the parsing code into a separate Java library which makes use of basic algorithms and dictionary files to help interpret the results.  This library might be useful for other tools requiring similar interpretation, or data cleaning efforts, and will be maintained over time as it will be in use in several GBIF tools.

The second challenge is that we need to determine if the point falls within the country.  There is always room for improvement in this area, such as understanding changes over time, but due to the huge volume of outliers when using the raw data a check like this is required.  Our implementation is a very basic reverse georeferencing RESTful web service that takes a latitude and longitude, and returns the proposed country and some basic information such as the title.  Operating the service requires PostGIS and a Java server like Apache Tomcat.  Currently we make use of freely available terrestrial shapefiles, and marine economic exclusion zones.  It would be trivial to expand the service to use more shapefiles for other uses, and is expected to happen over time.  Currently the GBIF service is an internal only processing service, but is expected to be released for public use in the coming months.

Improving the country name interpretation and making use of a more accurate geospatial verification service than previously will help improve data reporting at the national level using the GBIF portal as indicated here.

# Records # Georeferenced
Argentina Previously 665,284 284,012
Now 680,344 303,889
United States Previously 79,432,986 68,900,415
Now 81,483,086 70,588,182

Friday 7 October 2011

Group synergy

During the last few weeks we have been intensively designing and implementing what would come to be the new data portal. Oliver described nicely the new stage our team has entered in his last blog post Portal v2 - There will be cake. As my personal opinion, I think this has been truly a group experience as we have decided to change our paradigm of working. Normally we would have worked on different components each one of us and later try to integrate everything, but now we took the approach of just focusing on one subcomponent, all of us, and driving our efforts into it. From my point of view, the main advantage of this is that we avoid the Bus Factor element, that we as a small group of developers, are quite exposed to. Communication has increased among our team as we are all on the same page now.

As a general overview, the portal v2 will consist of different subcomponents (or sub-projects) that would need to interact between them to come up with a consolidated "view" for the user. Currently we have 3 different sub-projects on our tray, Checklist Bank, Registry, and Occurrence Store and our plan will be to have an API (exposed through web services) which will offer all data necessary (from these projects) for the portal to consume. The portal will then need to make use of a simple webservice client to communicate with this API.

Currently we have been working on the Checklist Bank sub-project. As Oliver pointed out in his previous post, some members of our team are more familiarized with certain sub-projects, and the checklist one does not escape from this reality. So for many, including me, it has been a learning experience. We have started development following very strict guidelines on API design and code conventions (which we document internally for our use). Even decisions that are sometimes taken in seconds by a single developer, are placed under group scrutiny so we are all on the same track. We have taken the commitment to apply the best coding practices.

Specifically on the checklist sub-project, we have come up with a preliminary API.  Please note this API won't be exposed to the public as it is. It is subject to change as we try to refine it. It is just nice to show to the outside world what we have been working on.

I personally think we are in exciting times inside GBIF and that the final product of all this effort would be a great tool that would benefit the community in big ways. Expect more from us!