Wanting to know the memory footprint and performance of different java options to keep simple string lookups in memory, I profiled different java.util collection classes that are filled with the same list of strings to see how much the memory usage differs. I then loaded the same data into 2 lucene in memory indices using lucenes RAMDirectory. I finally also evaluated an embedded file based & in memory H2 database. The data that has been loaded are 1.573.345 scientific name strings, the longest being about 150 characters. The original uncompressed text file is 31.6MB (zipped 8.1MB). To also test ID lookup in case of java.util.map or the KVP lucene index, the row number of each name has been used. The machine I used for testing was a MacPro 8-core 3GHz, 5GB RAM using Java6 with 2GB of memory (-Xmx2g) on Mac OSX 64bit. Here are the shortened results using System.currentTimeMillis() and JProfiler inspecting deep object copies in heap dumps (a seriously memory intensive thing too in some cases like lucene and H2 which even crashed):
Text file: 31.6 MB
Zipped: 8.1 MB
java.util.HashSet
# 264MB
# contains test of 10.000 x 12 terms took 28 msecs
# uses HashMap internally...
java.util.TreeSet
# 256MB
# contains test of 10.000 x 12 terms took 98 msecs
# uses TreeMap internally...
java.util.HashMap
# 300MB
# contains (key) test of 10.000 x 12 terms took 21 msecs
# includes Integer values as opposed to above Set
java.util.TreeMap
# 292MB
# contains (key) test of 10.000 x 12 terms took 93 msecs
# includes Integer values as opposed to above Set
java.util.ArrayList
# 172MB
# contains test of 100 x 12 terms took 43257 msecs
java.util.LinkedList
# 220MB
# contains test of 100 x 12 terms took 50564 msecs
String[] array
# 172MB
org.apache.commons.collections.map.HashedMap
# 384 MB
# contains (key) test of 10000 x 12 terms took 23 msecs
# no generics support !
javolution.util.FastList
# 220 MB
# contains test of 100 x 12 terms took 58886 msecs
javolution.util.FastMap
# 396 MB
# contains (key) test of 10000 x 12 terms took 18 msecs
javolution.util.FastSet
# 276MB
# contains test of 10000 x 12 terms took 10 msecs
javolution.util.FastTable
# 172MB
# contains test of 100 x 12 terms took 50961 msecs
gnu.trove.THashMap
# 331MB
# contains (key) test of 10000 x 12 terms took 65 msecs
gnu.trove.THashSet
# 185 MB
# contains test of 10000 x 12 terms took 47 msecs
com.google.common.collect.ImmutableSet
# 331MB
# contains test of 10000 x 12 terms took 19 msecs
com.google.common.collect.ImmutableMap
# 404MB
# contains (key) test of 10000 x 12 terms took 32 msecs
Lucene KVP index
# 94MB
# lucene 100 x 12 TermQueries took 58 msecs
# lucene 10000 x 12 TermQueries took 590 msecs
# in memory Index building a key value index with each term being a document and storing the value as a document field
# overhead of using an IndexSearcher to just the RAMDirectory is minimal
# 1573345 records loaded into Lucene KVP index in 13453 msecs
Lucene term index
# 44MB
# lucene 100 x 12 TermQueries took 32 msecs
# lucene 10000 x 12 TermQueries took 369 msecs
# in memory Index storing only the pure term index
# 1573345 records loaded into simple Lucene term index in 10222 msecs
H2 file based
# 33MB (connection object, not during querying)
# sql equals test of 10000 x 12 terms took 634 msecs
# file based db with 1 table, 1 indexed varchar(255) column
# 1573345 records loaded into H2 file database in 245025 msecs
H2 in memory
# ???MB (JProfiler requires >3gig memory)
# sql equals test of 10000 x 12 terms took 1024 msecs
# in memory db with 1 table, 1 indexed varchar(255) column
# 1573345 records loaded into H2 in memory database in 18760 msecs
Lucene came out with the least memory footprint of only 44MB, gnu.trove does best in terms of memory for classic sets, while javolution outperformes anyone else on speed. But their footprint is even slightly larger than the java.utilHashSet. Compared to KVP lucene the performance of H2 is pretty similar. Building the H2 file db took more than 10 times longer than the in memory one, so for regular updated and writes in memory offers lot more performance, but reads are a surprise. The file based version was nearly twice as fast as the in memory one! gnu.trove contains a lot of specialised classes to hold primitives as keys in sets or maps for example. In case one is using int or long this should be a much better footprint, but I haven't tested it as I am interested in Strings currently. For those interested, here is the source code that created the objects and did the time measurments (most tests are commented out in this particular revision).
I stumbled across another fast and efficient set implementation I highly recommend, the CompactHashSet from Ontopia.
ReplyDeleteQuick memory and performance comparisons with HashSet and the trove THashSet shows its superior to both:
*** HashSet ***
Memory before: 479152
Memory after: 8046768
Memory usage: 7567616
TIME: 210
Memory before: 478792
Memory after: 8048136
Memory usage: 7569344
TIME: 210
Memory before: 478792
Memory after: 8055912
Memory usage: 7577120
TIME: 200
*** CompactHashSet ***
Memory before: 478640
Memory after: 3707512
Memory usage: 3228872
TIME: 141
Memory before: 478640
Memory after: 3704128
Memory usage: 3225488
TIME: 138
Memory before: 478640
Memory after: 3703888
Memory usage: 3225248
TIME: 137
*** THashSet ***
Memory before: 478824
Memory after: 4300432
Memory usage: 3821608
TIME: 149
Memory before: 478824
Memory after: 4297480
Memory usage: 3818656
TIME: 149
Memory before: 478824
Memory after: 4305832
Memory usage: 3827008
TIME: 148