Monday 5 October 2009

Struts 2.1.6 and GUICE

Google GUICE is a great lightweight dependency injection framework that comes with a plugin for struts2. Using guice 2.0 and struts2.1.6 on tomcat6 I run into a dependency problem with different xwork and ognl jars provided by struts and the struts2 plugin from guice:
[INFO] +- org.apache.struts:struts2-core:jar:2.1.6:compile
[INFO] |  +- com.opensymphony:xwork:jar:2.1.2:compile
[INFO] |  |  \- org.springframework:spring-test:jar:2.5.6:test (scope managed from compile)
[INFO] |  +- opensymphony:ognl:jar:2.6.11:compile
[INFO] |  \- commons-fileupload:commons-fileupload:jar:1.2.1:compile
[INFO] +-
[INFO] |  \- opensymphony:xwork:jar:2.0.0:compile
[INFO] |     \- ognl:ognl:jar:2.6.9:compile

This can easily be resolved excluding the older OGNL library from guice like this in your POM:

Friday 17 July 2009

Darwin Core Archive Reader part1

After writing software that produces Darwin Core Archives (DwC-A), I thought it is time to introduce a little client library that can read DwC-A and makes it very simple to consume it. To create DwC-A there are a couple of resources available:
  1. The Darwin Core Text Guidelines, pretty much the specification
  2. The ECAT Checklist Format gives some best practices and focus on how to encode taxonomic data as DwC-A
  3. The IPT produces DwC-A and has a rich web interface but is limited to ~1 million records
  4. The IPT-lite produces large DwC-A datasets very fast, but lacks any visualisation of the data itself. Its a great way of quickly creating DwC-A and hosting it online
The DwC Archive reader library is under active development and currently being used for indexing checklists. The project is hosted as part of the GBIF Indexing and Harvesting Toolkit, where a jar with all dependencies is hosted for download too. You can use this jar in the terminal to inspect archives like this:
$ java -jar DwcA-reader-1.0-SNAPSHOT.jar hershkovitz.txt

Opening archive: /Users/markus/Desktop/hershkovitz/hershkovitz.txt
Core file(s) found: [/Users/markus/Desktop/hershkovitz/hershkovitz.txt]
Core row type:
Core identifier column: 0
Cannot locate term dwc:kingdom
Cannot locate term dwc:family
Number of extensions 0
Archive contains 3249 core records.
The reader can handle just a simple CSV or tab file, or you can point it to a dwc archive folder with a meta.xml descriptor and several data files. The iterator(s) allow you to walk the core records of the archive while conveniently retrieving all extension records at the same time. You can also tell it to show some core record, passing a limit and offset like this from the commandline:
java -jar DwcA-reader-1.0-SNAPSHOT.jar hershkovitz 10 25
A simple dummy source code example on how to use this reader in your code is shown here:
package org.gbif.dwc.text;

import java.util.Iterator;

import org.gbif.dwc.model.DarwinCoreRecord;
import org.gbif.dwc.model.ExtensionRecord;
import org.gbif.dwc.model.StarRecord;
import org.gbif.dwc.terms.DwcTerm;

public class UsageExample {

 public static void main(String[] args) throws IOException, UnsupportedArchiveException {
  // opens csv files with headers or dwc-a direcotries with a meta.xml descriptor
  Archive arch = ArchiveFactory.openArchive(new File("pontaurus.txt"));

  // does scientific name exist?
  if (!arch.getCore().hasTerm(DwcTerm.scientificName)){
   System.out.println("This application requires dwc-a with scientific names");

  // loop over core darwin core records
  Iterator iter = arch.iteratorDwc();
  DarwinCoreRecord dwc;
   dwc =;

  // loop over star records. i.e. core with all linked extension records
  for (StarRecord rec : arch){
   // print core ID + scientific name
   System.out.println(" - "+rec.value(DwcTerm.scientificName));
   for (ExtensionRecord erec : rec){
    // does this extension have Long/lat?
    if (rec.dataFile().hasTerm(DwcTerm.decimalLongitude) && rec.dataFile().hasTerm(DwcTerm.decimalLatitude)){
     System.out.println("Georeferenced: " + rec.value(DwcTerm.decimalLongitude)+","+rec.value(DwcTerm.decimalLatitude));;

Tuesday 23 June 2009

Profiling memory usage of various String collections

Wanting to know the memory footprint and performance of different java options to keep simple string lookups in memory, I profiled different java.util collection classes that are filled with the same list of strings to see how much the memory usage differs. I then loaded the same data into 2 lucene in memory indices using lucenes RAMDirectory. I finally also evaluated an embedded file based & in memory H2 database. The data that has been loaded are 1.573.345 scientific name strings, the longest being about 150 characters. The original uncompressed text file is 31.6MB (zipped 8.1MB). To also test ID lookup in case of or the KVP lucene index, the row number of each name has been used. The machine I used for testing was a MacPro 8-core 3GHz, 5GB RAM using Java6 with 2GB of memory (-Xmx2g) on Mac OSX 64bit. Here are the shortened results using System.currentTimeMillis() and JProfiler inspecting deep object copies in heap dumps (a seriously memory intensive thing too in some cases like lucene and H2 which even crashed):

Text file: 31.6 MB
Zipped: 8.1 MB

# 264MB
# contains test of 10.000 x 12 terms took 28 msecs
# uses HashMap internally...

# 256MB
# contains test of 10.000 x 12 terms took 98 msecs
# uses TreeMap internally...

# 300MB
# contains (key) test of 10.000 x 12 terms took 21 msecs
# includes Integer values as opposed to above Set

# 292MB
# contains (key) test of 10.000 x 12 terms took 93 msecs
# includes Integer values as opposed to above Set

# 172MB
# contains test of 100 x 12 terms took 43257 msecs

# 220MB
# contains test of 100 x 12 terms took 50564 msecs

String[] array
# 172MB
# 384 MB
# contains (key) test of 10000 x 12 terms took 23 msecs
# no generics support !

# 220 MB
# contains test of 100 x 12 terms took 58886 msecs

# 396 MB
# contains (key) test of 10000 x 12 terms took 18 msecs

# 276MB
# contains test of 10000 x 12 terms took 10 msecs

# 172MB
# contains test of 100 x 12 terms took 50961 msecs

# 331MB
# contains (key) test of 10000 x 12 terms took 65 msecs

# 185 MB
# contains test of 10000 x 12 terms took 47 msecs
# 331MB
# contains test of 10000 x 12 terms took 19 msecs
# 404MB
# contains (key) test of 10000 x 12 terms took 32 msecs

Lucene KVP index
# 94MB
# lucene 100 x 12 TermQueries took 58 msecs
# lucene 10000 x 12 TermQueries took 590 msecs
# in memory Index building a key value index with each term being a document and storing the value as a document field
# overhead of using an IndexSearcher to just the RAMDirectory is minimal
# 1573345 records loaded into Lucene KVP index in 13453 msecs

Lucene term index
# 44MB
# lucene 100 x 12 TermQueries took 32 msecs
# lucene 10000 x 12 TermQueries took 369 msecs
# in memory Index storing only the pure term index
# 1573345 records loaded into simple Lucene term index in 10222 msecs

H2 file based
# 33MB (connection object, not during querying)
# sql equals test of 10000 x 12 terms took 634 msecs
# file based db with 1 table, 1 indexed varchar(255) column
# 1573345 records loaded into H2 file database in 245025 msecs

H2 in memory
# ???MB (JProfiler requires >3gig memory)
# sql equals test of 10000 x 12 terms took 1024 msecs
# in memory db with 1 table, 1 indexed varchar(255) column
# 1573345 records loaded into H2 in memory database in 18760 msecs

Lucene came out with the least memory footprint of only 44MB, gnu.trove does best in terms of memory for classic sets, while javolution outperformes anyone else on speed. But their footprint is even slightly larger than the java.utilHashSet. Compared to KVP lucene the performance of H2 is pretty similar. Building the H2 file db took more than 10 times longer than the in memory one, so for regular updated and writes in memory offers lot more performance, but reads are a surprise. The file based version was nearly twice as fast as the in memory one! gnu.trove contains a lot of specialised classes to hold primitives as keys in sets or maps for example. In case one is using int or long this should be a much better footprint, but I haven't tested it as I am interested in Strings currently. For those interested, here is the source code that created the objects and did the time measurments (most tests are commented out in this particular revision).

Tuesday 12 May 2009

Deploying the portal web application

For building the web application

The steps for building and deploying the portal web application are as follows:

1) Download the source code at:

The modules needed are:
  • portal-core
  • portal-index
  • portal-service
  • portal-web
For instructions on how to checkout this modules from the SVN to your machine, please see

2) Once that the modules have been saved to your machine, you need to build them. There is a script on the portal-web module for automatically building all the project and downloading all the dependencies (libraries) from the repositories

Script location: portal-web/

For building the database

1) On the portal-core project, there is a file at db/portal.ddl that builds the initial structure for the index DB for the portal.

mysql> create database portal;
mysql -u [username] -p [database] < /PATH_TO_FILE/portal.ddl;

2) For populating the database with the minimum data required, there is a file at db/ for doing such activity.

mysql -u [username] -p [database] < /PATH_TO_FILE/;

Monday 27 April 2009

GBIF Maven Repository

GBIF uses Maven to build projects, manage the dependencies and also to generate online java docs as part of a maven site. I would like to take the chance and introduce some basic maven features that we use at GBIF.

Repository & Sites

We host a maven repository that we use for keeping external not yet mavenized libraries and to deploy our own developments. All projects can deploy a maven site with java docs, test coverage, dependencies and the regular maven things. The subfolder will be named after the artifactId of the project, so make sure its unique within GBIF! Apache has more information on customizing a maven site per project.


All GBIF maven projects should make use of our shared parent POM that defines the repository and site URLs, the apache 2 licensing, other popular maven repositories and basic build rules. One of the most important settings in this mother pom is the groupId=org.gbif. We would like all GBIF projects to share the same groupId, which means the artifactId has to be unique for all projects! This allows us to deploy maven sites easily and reduces scattered code, so please don't override it! In case you need to add it manually (not using the archetypes below) to your project, place the following at the top of your POM:
For existing poms please also update the following:
  • inherit from this pom (via at the very top)
  • remove groupIds, so that we all inherit "org.gbif"
  • update our artifactId so that it is unique (we now all share org.gbif)
  • add new property googlecode.project so that SVN, issue tracking and project homepage is set out of the box: <googlecode.project>gbif-indexingtoolkit</googlecode.project>
  • optionally add/update developers


In order to deploy to the repository or create new maven sites, authentication is required. The parent POM contains all the public information, but you need to have a local maven settings file that contains at least the user and password. A minimal settings.xml looks like this:
     <password>the maven-user password, please email us if you need it</password>
On OSX this file should be sitting in your maven home folder, i.e.

Maven Archetypes

We host 2 basic gbif archetypes that you can use to start a new project. A very simple one basically only containing the link to the GBIF parent POM, and a struts2.1 web project one that contains many dependencies and a common GBIF css theme (the theme is currently still under construction) To generate an empty GBIF maven project simply do:
mvn archetype:create -DarchetypeGroupId=org.gbif -DarchetypeArtifactId=gbif-archetype -DarchetypeVersion=1.0-SNAPSHOT -DremoteRepositories= -DgroupId=org.gbif -DartifactId=myProjectName
The complete struts2.1 webapplication archetype with dependencies for struts2, hibernate3, spring2.5, lucene, hadoop and many more has archetypeArtifactId=gbif-war-archetype. The idea here is to maintain a single basic web project and to remove the dependencies you don't need from the pom. This webapp can login/logout already. I would hope to improve it over time to include a standard GBIF authentication and some generic REST / CRUD base actions instead of the REST plugin, but that hasn't been done so far. It is deployed as a snapshot and you can use it like this:
mvn archetype:create -DarchetypeGroupId=org.gbif -DarchetypeArtifactId=gbif-war-archetype -DarchetypeVersion=1.0-SNAPSHOT -DremoteRepositories= -DgroupId=org.gbif -DartifactId=myProjectName

Deploying to the repository

In order to deploy your project remember to have the settings.xml in place. When deploying, any existing version with the same combination of groupId, artifactId and version will be overwritten, so please follow these rules:
  • only deploy when its working (ideally no tests fail)
  • use x.y-SNAPSHOT in case this is a "transient" release that gets overwritten each time. Proper releases without SNAPSHOT should only be deployed once! Maven caches non snapshot releases, so they would not get updated and, a release should be a release and not touched again.
To deploy your current local copy to the repository simply issue the following command:
mvn deploy
For creating/updating the maven site do:
mvn site-deploy