One of IPT1's most serious problems was its inability to handle large datasets. For example, a dataset with only half a million records (relatively small compared to some of the biggest in the GBIF network) caused the application to slow down to such a degree that even the most patient users were throwing their hands up in dismay.
Anyways, I wanted to see for myself whether the IPT’s problems with large datasets have been overcome or not in the newest version: IPT2.
Here’s what I did to run the test: First, I connected to a MySQL database and used a “select * from … limit …” query to define my source data totalling 24 million records (the same number of records as a large dataset coming from Sweden). Next, I mapped 17 columns to Darwin Core occurrence terms and once this was done I was able to start the publication of a Darwin Core Archive (DwC-A). The publication took just under 50 minutes to finish, processing approximately 500,000 records per minute. Take a look at the screenshot below that was taken after the successful publication. Important to note is that this test was run on a Tomcat server with only 256MB of memory. In fact, special care was taken during IPT2 design to ensure it could still run on older hardware that didn’t have a lot of memory. It’s worth noting that this is one of the reasons why IPT2 is not as feature rich as the IPT1 was.
So just how does the IPT2 handle 24 million records coming from a database while running on a system with so little memory? The answer is that instead of returning all records at once, they are retrieved in small result sets only having about 1000 records each. These results sets are then streamed to file and immediately written to disk. The final DwC-A generated was 3.61GB in size, so some disk space is obviously needed too.
Therefore in conclusion I feel that he IPT2 has successfully overcome its previous problems handling large datasets. I hope other adopters will now give it a shot themselves.
This is what phylodiversity looks like
4 weeks ago