In our rollover process, which turns our raw harvested data into the interpreted occurrences you can see on our portal, we now have a step that calls a Web Service to turn geographical coordinates into country names. We use this to enrich and validate the incoming data.
This step in our process usually took about three to four hours but last week it stopped working all together without any changes to the Web Service or the input data.
We've spent a lot of time trying to find the problem and while we still can't say for sure what the exact problem is or was we've found a fix that works for us which also allows us to make some assumptions about the cause of the failure.
It is a project called geocode-ws and it is a very simple project that uses MyBatis to call a PostgreSQL (8.4.2) & PostGIS (1.4.0) database which does the GISy work of finding matches.
Our process started out fine. The first few million calls to the Web Service were fine and returning reasonably fast but then at the end the process slowed down until it came almost to a complete stop with response times of over 10 minutes. That's when our Hadoop maps timed out and failed.
With hindsight we should have come to our final conclusion much earlier but it took us a while.
Looking for the problem
We've looked at the PostgreSQL configuration and tweaked it a lot. We added a lot more logging and we've made sure to log any long running statements (using the log_min_duration_statement option). We also made sure that our memory settings are sensible and that we don't run out of memory. Looking at io- and vmstats as well as our Cacti monitoring we could see that this wasn't the case though. PostgreSQL didn't seem to be the problem.
We also looked at the OS configuration itself as well as the connectivity between our Hadoop cluster and this Tomcat and PostgreSQL server but couldn't find the problem either.
Then we began improving our Web Service and implemented a JMX MBean to get more detailed information about the process. While our changes should have improved the code base they didn't fix the problem. Finally we enabled GC logging on our Tomcat server (something we should have done much earlier and we will probably do by default for our servers in the future). We didn't do it earlier because the Web Service didn't experience any symptoms of memory leak issues before and we didn't change anything there. It hadn't even been restarted in a while.
But as it turned out the problem was garbage collection. Unfortunately I can't provide pretty graphs because I've overwritten the GC logs but it was very easy to see (using the awesome GCViewer) a typical pattern of minor collections not reclaiming all space and growing memory usage up until the point where almost no memory could be reclaimed and most of the time was spent in Garbage Collection. We found the problem! This explained our time outs.
It still doesn't explain what was leaking though. And having spent that much time on it we quickly gave up trying to find the problem. We suspect some kind of combination between the MyBatis Connection Pool, the PostgreSQL JDBC driver and our configuration.
Our workaround (the MyBatis & C3P0 part)
So we looked around for other connection pools for use with MyBatis but unfortunately we couldn't find a ready made thing. There are implementations in the MyBatis-Guice project but they can only be used with the Annotation based configuration and we're using XML.
We ended up writing our own implementation of a C3P0 DataSourceFactory and it turned out to be very very easy: It is just one class (JavaDoc here) with one line of code in it.
This not only solved our apparent memory leak but the performance increased by a factor of two to three as well. We haven't had a problem with our setup since.
We didn't have the time to find the real problem but we found a solution that works for us. I suspect had we gone about this better we might have found the problem a lot sooner and perhaps identified the real reason for it.
- Enable GC logging!
- Enable JMX for Tomcat and set up your applications with useful metrics and logging
- Even though the use of Profilers is heavily disputed they can often help. We've found YourKit to be excellent
- Try to follow a logical route, change only one thing at a time, mock things to find a problem
- Monitor and graph your systems