Wednesday, 13 April 2011

Can IPT2 handle big datasets now?

One of IPT1's most serious problems was its inability to handle large datasets. For example, a dataset with only half a million records (relatively small compared to some of the biggest in the GBIF network) caused the application to slow down to such a degree that even the most patient users were throwing their hands up in dismay.
Anyways, I wanted to see for myself whether the IPT’s problems with large datasets have been overcome or not in the newest version: IPT2.

Here’s what I did to run the test: First, I connected to a MySQL database and used a “select * from … limit …” query to define my source data totalling 24 million records (the same number of records as a large dataset coming from Sweden). Next, I mapped 17 columns to Darwin Core occurrence terms and once this was done I was able to start the publication of a Darwin Core Archive (DwC-A). The publication took just under 50 minutes to finish, processing approximately 500,000 records per minute. Take a look at the screenshot below that was taken after the successful publication. Important to note is that this test was run on a Tomcat server with only 256MB of memory. In fact, special care was taken during IPT2 design to ensure it could still run on older hardware that didn’t have a lot of memory. It’s worth noting that this is one of the reasons why IPT2 is not as feature rich as the IPT1 was.

So just how does the IPT2 handle 24 million records coming from a database while running on a system with so little memory? The answer is that instead of returning all records at once, they are retrieved in small result sets only having about 1000 records each. These results sets are then streamed to file and immediately written to disk. The final DwC-A generated was 3.61GB in size, so some disk space is obviously needed too.

Therefore in conclusion I feel that he IPT2 has successfully overcome its previous problems handling large datasets. I hope other adopters will now give it a shot themselves.


  1. Can't we also add the fact that there is no statistics calculation in the new IPT which took a big amout of memory on the previous version?

  2. Statistics and deriving taxonomies etc. were performance issues which was limited by both in memory consumption and disk io. The IPT 1.0 used the embedded Java database called H2 so most work was actually SQL and disk bound, but being a Java database it also used the JVM memory for the indexes. Geoserver was one of the main memory hogs in the 1.0 IPT. The simplification in IPT 2.0 makes it so much more robust - config and data are just standard files, rather than in a database (e.g. binary files).

  3. Thats a fantastic result! Congrats to all the team involved on it. Handling 24Million records databases is been always a huge pain. Looks like not anymore...