Monday, 10 December 2012
"I noticed that the GBIF data portal has fewer records than it used to – what happened?"
Monday, 29 October 2012
The GBIF Registry is now dataset-aware!
But, if you visited the GBIF Registry and did the same search for the Academy of Natural Sciences, prior to the Registry being dataset-aware, you would have seen it has a DiGIR endpoint, but not found it has any datasets!
Now that the GBIF Registry is dataset-aware, however, the Registry page for the Academy of Natural Sciences shows that the organization owns 3 datasets, and has a (DiGIR) Technical Installation.
Wednesday, 17 October 2012
IPT v2.0.4 released
- Nicolas Noé (Belgian Biodiversity Platform, Belgium) - French
- TaiBIF, Taiwan - Traditional Chinese
- Laura Roldan Gomez, Dairo Escobar, and Daniel Amariles, (Colombian Biodiversity Information System (SiB)) - Spanish
Friday, 13 July 2012
Getting started with DataCube on HBase
In a small database much of this would be trivial using aggregating functions (SUM(), COUNT() etc). As the volume grows, one often precalculates these metrics which brings it's own set of consistency challenges. As one outgrows a database, as GBIF are, we need to look for new mechanisms to manage these metrics. The features of DataCube that make this attractive to us are:
- A managable process to modify the cube structure
- A higher level API to develop against
- Ability to rebuild the cube with a single pass over the source data
ID, Kingdom, ScientificName, Country, IsoCountryCode, BasisOfRecord, CellId, Year 1, Animalia, Puma concolor, Peru, PE, Observation, 13245, 1967 2, Plantae, Abies alba, Spain, ES, Observation, 3637, 2010 3, Plantae, Abies alba, Spain, ES, Observation, 3638, 2010Suppose the following metrics are required, each of which is termed a rollup in OLAP:
- Number of records per country
- Number of records per kingdom
- Number of records georeferenced / not georeferenced
- Number of records per kingdom per country
- Number of records georeferenced / not georeferenced per country
- Number of records georeferenced / not georeferenced per kingdom
- Number of records georeferenced / not georeferenced per kingdom per country
/** * The cube definition (package access only). * Dimensions are Country, Kingdom and Georeferenced with counts available for: *In this code, we are making use of ID substitution for the kingdom. ID substitution is an inbuilt feature of DataCube whereby an auto-generated ID is used to substitute verbose coordinates (a value for a dimension). This is an important feature to help improve performance as coordinates are used to construct the cube lookup keys, which translate into the key used for the HBase table. The substitution is achieved by using a simple table holding a running counter and a mapping table holding the field-to-id mapping. When inserting data into the cube, the counter is incremented (with custom locking to support concurrency within the cluster), the mapping is stored, and the counter value used as the coordinate. When reading, the mapping table is used to construct the lookup key. With the cube defined, we are ready to populate it. One could simply iterate over the source data and populate the cube with the likes of the following:*
* TODO: write public utility exposing a simple API enabling validated read/write access to cube. */ class Cube { // no id substitution static final Dimension- Country (e.g. number of record in DK)
*- Kingdom (e.g. number of animal records)
*- Georeferenced (e.g. number of records with coordinates)
*- Country and kingdom (e.g. number of plant records in the US)
*- Country and georeferenced (e.g. number of records with coordinates in the UK
*- Country and kingdom and georeferenced (e.g. number of bacteria records with coordinates in Spain)
*COUNTRY = new Dimension ("dwc:country", new StringToBytesBucketer(), false, 2); // id substitution applies static final Dimension KINGDOM = new Dimension ("dwc:kingdom", new StringToBytesBucketer(), true, 7); // no id substitution static final Dimension GEOREFERENCED = new Dimension ("gbif:georeferenced", new BooleanBucketer(), false, 1); // Singleton instance if accessed through the instance() method static final DataCube INSTANCE = newInstance(); // Not for instantiation private Cube() { } /** * Creates the cube. */ private static DataCube newInstance() { // The dimensions of the cube List > dimensions = ImmutableList. >of(COUNTRY, KINGDOM, GEOREFERENCED); // The way the dimensions are "rolled up" for summary counting List rollups = ImmutableList.of(new Rollup(COUNTRY), new Rollup(KINGDOM), new Rollup(GEOREFERENCED), new Rollup(COUNTRY, KINGDOM), new Rollup(COUNTRY, GEOREFERENCED), new Rollup(KINGDOM, GEOREFERENCED), // more than 2 requires special syntax new Rollup(ImmutableSet. of(new DimensionAndBucketType(COUNTRY), new DimensionAndBucketType(KINGDOM), new DimensionAndBucketType(GEOREFERENCED)))); return new DataCube (dimensions, rollups); } }
DataCubeIoHowever, one should consider what to do when you have the following inevitable scenarios:dataCubeIo = setup(Cube.INSTANCE); // omitted for brevity dataCubeIo.writeSync(new LongOp(1), new WriteBuilder(Cube.INSTANCE) .at(Cube.COUNTRY, "Spain") // for example .at(Cube.KINGDOM, "Animalia") .at(Cube.GEOREFERENCED, true) );
- A new dimension or rollup is to be added to the running cube
- Changes to the source data have occurred without the cube being notified (e.g. through a batch load, or missing notifications due to messaging failures)
- Some disaster recovery requiring a cube rebuild
- A snapshot of the live cube is taken and stored in a snapshot table
- An offline cube is calculated from the source data and stored in a backfill table
- The snapshot and live cube are compared to determine changes that were accepted in the live cube during the rebuilding process (step 2). These changes are then applied to the backfill table
- The backfill is hot swapped to become the live cube
/** * The callback used from the backfill process to spawn the job to write the new data in the cube. */ public class BackfillCallback implements HBaseBackfillCallback { // Property keys passed in on the job conf to the Mapper static final String TARGET_TABLE_KEY = "gbif:cubewriter:targetTable"; static final String TARGET_CF_KEY = "gbif:cubewriter:targetCF"; // Controls the scanner caching size for the source data scan (100-5000 is reasonable) private static final int SCAN_CACHE = 200; // The source data table private static final String SOURCE_TABLE = "dc_occurrence"; @Override public void backfillInto(Configuration conf, byte[] table, byte[] cf, long snapshotFinishMs) throws IOException { conf = HBaseConfiguration.create(); conf.set(TARGET_TABLE_KEY, Bytes.toString(table)); conf.set(TARGET_CF_KEY, Bytes.toString(cf)); Job job = new Job(conf, "CubeWriterMapper"); job.setJarByClass(CubeWriterMapper.class); Scan scan = new Scan(); scan.setCaching(SCAN_CACHE); scan.setCacheBlocks(false); // we do not want to get bad counts in the cube! job.getConfiguration().set("mapred.map.tasks.speculative.execution", "false"); job.getConfiguration().set("mapred.reduce.tasks.speculative.execution", "false"); job.setNumReduceTasks(0); TableMapReduceUtil.initTableMapperJob(SOURCE_TABLE, scan, CubeWriterMapper.class, null, null, job); job.setOutputFormatClass(NullOutputFormat.class); try { boolean b = job.waitForCompletion(true); if (!b) { throw new IOException("Unknown error with job. Check the logs."); } } catch (Exception e) { throw new IOException(e); } } }
/** * The Mapper used to read the source data and write into the target cube. * Counters are written to simplify the spotting of issues, so look to the Job counters on completion. */ public class CubeWriterMapper extends TableMapperWith the callback written, all that is left to populate the cube is to run the backfill. Note that this process can also be used to bootstrap the live cube for the first time:{ // TODO: These should come from a common schema utility in the future // The source HBase table fields private static final byte[] CF = Bytes.toBytes("o"); private static final byte[] COUNTRY = Bytes.toBytes("icc"); private static final byte[] KINGDOM = Bytes.toBytes("ik"); private static final byte[] CELL = Bytes.toBytes("icell"); // Names for counters used in the Hadoop Job private static final String STATS = "Stats"; private static final String STAT_COUNTRY = "Country present"; private static final String STAT_KINGDOM = "Kingdom present"; private static final String STAT_GEOREFENCED = "Georeferenced"; private static final String STAT_SKIPPED = "Skipped record"; private static final String KINGDOMS = "Kingdoms"; // The batch size to use when writing the cube private static final int CUBE_WRITE_BATCH_SIZE = 1000; static final byte[] EMPTY_BYTE_ARRAY = new byte[0]; private DataCubeIo dataCubeIo; @Override protected void cleanup(Context context) throws IOException, InterruptedException { super.cleanup(context); // ensure we're all flushed since batch mode dataCubeIo.flush(); dataCubeIo = null; } /** * Utility to read a named field from the row. */ private Integer getValueAsInt(Result row, byte[] cf, byte[] col) { byte[] v = row.getValue(cf, col); if (v != null && v.length > 0) { return Bytes.toInt(v); } return null; } /** * Utility to read a named field from the row. */ private String getValueAsString(Result row, byte[] cf, byte[] col) { byte[] v = row.getValue(cf, col); if (v != null && v.length > 0) { return Bytes.toString(v); } return null; } @Override protected void map(ImmutableBytesWritable key, Result row, Context context) throws IOException, InterruptedException { String country = getValueAsString(row, CF, COUNTRY); String kingdom = getValueAsString(row, CF, KINGDOM); Integer cell = getValueAsInt(row, CF, CELL); WriteBuilder b = new WriteBuilder(Cube.INSTANCE); if (country != null) { b.at(Cube.COUNTRY, country); context.getCounter(STATS, STAT_COUNTRY).increment(1); } if (kingdom != null) { b.at(Cube.KINGDOM, kingdom); context.getCounter(STATS, STAT_KINGDOM).increment(1); context.getCounter(KINGDOMS, kingdom).increment(1); } if (cell != null) { b.at(Cube.GEOREFERENCED, true); context.getCounter(STATS, STAT_GEOREFENCED).increment(1); } if (b.getBuckets() != null && !b.getBuckets().isEmpty()) { dataCubeIo.writeSync(new LongOp(1), b); } else { context.getCounter(STATS, STAT_SKIPPED).increment(1); } } // Sets up the DataCubeIO with IdService etc. @Override protected void setup(Context context) throws IOException, InterruptedException { super.setup(context); Configuration conf = context.getConfiguration(); HTablePool pool = new HTablePool(conf, Integer.MAX_VALUE); IdService idService = new HBaseIdService(conf, Backfill.LOOKUP_TABLE, Backfill.COUNTER_TABLE, Backfill.CF, EMPTY_BYTE_ARRAY); byte[] table = Bytes.toBytes(conf.get(BackfillCallback.TARGET_TABLE_KEY)); byte[] cf = Bytes.toBytes(conf.get(BackfillCallback.TARGET_CF_KEY)); DbHarness hbaseDbHarness = new HBaseDbHarness (pool, EMPTY_BYTE_ARRAY, table, cf, LongOp.DESERIALIZER, idService, CommitType.INCREMENT); dataCubeIo = new DataCubeIo (Cube.INSTANCE, hbaseDbHarness, CUBE_WRITE_BATCH_SIZE, Long.MAX_VALUE, SyncLevel.BATCH_SYNC); } }
// The live cube table final byte[] CUBE_TABLE = "dc_cube".getBytes(); // Snapshot of the live table used during backfill final byte[] SNAPSHOT_TABLE = "dc_snapshot".getBytes(); // Backfill table built from the source final byte[] BACKFILL_TABLE = "dc_backfill".getBytes(); // Utility table to provide a running count for the identifier service final byte[] COUNTER_TABLE = "dc_counter".getBytes(); // Utility table to provide a mapping from source values to assigned identifiers final byte[] LOOKUP_TABLE = "dc_lookup".getBytes(); // All DataCube tables use a single column family final byte[] CF = "c".getBytes(); HBaseBackfill backfill = new HBaseBackfill( conf, new BackfillCallback(), // our implementation CUBE_TABLE, SNAPSHOT_TABLE, BACKFILL_TABLE, CF, LongOp.LongOpDeserializer.class); backfill.runWithCheckedExceptions();While HBase provides the storage for the cube, a backfill could be implemented against any source data, such as from a database over JDBC or from text files stored on a Hadoop filesystem. Finally we want to be able to read our cube:
DataCubeIoAll the source code for the above is available in the GBIF labs svn.dataCubeIo = setup(Cube.INSTANCE); // omitted for brevity Optional result = cubeIo.get( new ReadBuilder(cube) .at(Cube.COUNTRY, "DK") .at(Cube.KINGDOM, "Animalia")); // need to check if this coordinate combination hit anything in the cube if (result.isPresent()) { LOG.info("Animal records in Denmark: " + result.get().getLong()); )
Many thanks to Dave Revell at UrbanAirship for his guidance.
Wednesday, 11 July 2012
Optimizing Writes in HBase
The problem
java.io.IOException: org.apache.hadoop.hbase.client.ScannerTimeoutException: 63882ms passed since the last invocation, timeout is currently set to 60000
We did some digging and found that this exception happens when the scanner's next() method hasn't been called within the timeout limit. We simplified our test case by reproducing this same error when doing a simple CopyTable operation (the one that ships with HBase), and again using Hive to do select overwrite table A select * from table B. In both cases mappers are assigned a split to scan based on TableInputFormat (just like our Hive jobs), and as they scan they simply put the record out to the new table. Something is holding up the loop as it tries to put, preventing it from calling next(), and thereby triggering the timeout exception.
The logs
WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region occurrence,\x17\xF1o\x9C,1340981109494.ecb85155563c6614e5448c7d700b909e. has too many store files; delaying flush up to 90000ms
INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 7 on 60020' on region occurrence,\x17\xF1o\x9C,1340981109494.ecb85155563c6614e5448c7d700b909e.: memstore size 128.2m is >= than blocking 128.0m size
Why it blocks
- if the flush thread wakes up and memstores are greater than the lower limit it will start flushing (starting with current biggest memstore) until it gets below the limit.
- if flush thread wakes up and memstores are greater than the upper limit it will block updates and start flushing until it gets under lower limit, when it unblocks updates.
Stopping the blocking
- first we tried 128MB with block multiplier still at 2. This still produced too many storefiles and caused the same blocking (maybe a little better than at 64MB)
- then tried 256MB with multiplier of 4. The logs and ganglia showed that the flushes were happening well before 256MB (still around 130MB) "due to global heap pressure" - a sign that total memstores were consuming too much heap. This meant we were still generating too many files and got the same blocking problem, but with the "memstore blocking limit" set to 1GB the memstore blocking happened much less often, and later in the process (still killed the mappers though)
Now what?
Things looked kind of grim. After endless poring over ganglia charts, we kept coming back to one unexplained blip that seemed to coincide with the start of the storefile explosion that eventually killed the jobs.Figure 1: Average memstore flush size over time |
From the start of Figure 1 we can see that things appear to be going smoothly - the memstores are flushing at or just above 256MB, which means they have enough heap and are doing their jobs. From the logs we see the flushes happening fine, but there are regular lines like the following:
INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Under global heap pressure: Region uat_occurrence,\x06\x0E\xAC\x0F,1341574060728.ab7fed6ea92842941f97cb9384ec3d4b. has too many store files, but is 625.1m vs best flushable region's 278.2m. Choosing the bigger.This isn't quite as bad as the "delaying flush" line, but it shows that we're on the limit of what our heap can handle. Then starting from around 12:20 we see more and more like the following:
WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region uat_occurrence,"\x98=\x1C,1341567129447.a3a6557c609ad7fc38815fdcedca6c26. has too many store files; delaying flush up to 90000msand then to top it off:
INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=35, maxlogs=32; forcing flush of 1 regions(s): ab7fed6ea92842941f97cb9384ec3d4b
INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up with memory above low water.So what we have is memstores initially being forced to flush because of minor heap pressure (adds storefiles faster than we can compact). Then we have memstores delaying flushes because of too many storefiles (memstores start getting bigger - our graph spike). Then the write ahead log (WAL) complains about too many of its logs, which forces a memstore flush (so that the WAL HLog can be safely discarded - this again adds storefiles). And for good measure the flushing thread now wakes up, finds its out of heap, and starts attempting flushes, which just aggravates the problem (adding more storefiles to the pile). The failure doesn't happen immediately but we're past the point of no return - by about 13:00 the memstores are getting to the "memstore blocking limit" and our mappers die.
What's left?
Knowing what's going on in the memstore flush size analysis is comforting, but just reinforces what we already knew - the problem is too many storefiles. So what's left? Raising the max number of storefiles, that's what! Every storefile in a store consumes resources (file handles, xceivers, heap for holding metadata), which is why it's limited to a maximum of a very modest 7 files by default. But for us this level of write load is rare - in normal operations we won't hit anything like this, and having the capacity to write a whole bunch of storefiles over short-ish, infrequent bursts is relatively safe, since we know our nightly major compaction will clean them up again (and thereby free up all those extra resources). Crank that maximum up to 200 and, finally, the process works!
Conclusion
<!-- default is 256MB 268435456, this is 1.5GB --> <property> <name>hbase.hregion.max.filesize</name> <value>1610612736</value> </property> <!-- default is 2 --> <property> <name>hbase.hregion.memstore.block.multiplier</name> <value>4</value> </property> <!-- default is 64MB 67108864 --> <property> <name>hbase.hregion.memstore.flush.size</name> <value>134217728</value> </property> <!-- default is 7, should be at least 2x compactionThreshold --> <property> <name>hbase.hstore.blockingStoreFiles</name> <value>200</value> </property>
Many thanks to Lars George, who helped us get through the "Now What?" phase by digging deep into the logs and the source to help us work out what was going on.
Monday, 18 June 2012
Launch of the Canadensys explorer
At Canadensys we already adopted and customized the IPT as our data repository. With the data of our network being served by the IPT, we have now built a tool to aggregate and explore these data. For an overview of how we built our network, see this presentation. The post below originally appeared on the Canadensys blog.
We are very pleased to announce the beta version of the Canadensys explorer. The tool allows you to explore, filter, visualize and download all the specimen records published through the Canadensys network.
The explorer currently aggregates nine published collections, comprising over half a million specimen records, with many more to come in the near future. All individual datasets are available on the Canadensys repository and via the Global Biodiversity Information Facility (GBIF) as well. The main functionalities of the explorer are listed below, but we encourage you to discover them for yourself instead. We hope it is intuitive. For the best user experience, please use an up-to-date version of your browser.
Happy exploring: http://data.canadensys.net/explorer
Functionalities
- The explorer is a one page tool, limiting unnecessary navigation.
- The default view shows all the data, allowing users to get an overview and explore immediately.
- Data can be queried by using and combining filters.
- Filters use smart suggestions, informing the user how often their search term occurs even before they search.
- The exact number of results is displayed, including the number of georeferenced records.
- The map view displays all georeferenced records for the current query, has different base layer options and can be zoomed in to any level.
- Points on the map can be clicked for more information.
- The table view displays a sortable preview of the records in the current query.
- Records in the table can be clicked for more information in the same way as on the map.
- The number of columns in the table responds to the screen width and can be controlled by the user in the display panel.
- Data for the current query can be downloaded as a Darwin Core archive. There is no limit on the number of records that can be downloaded.
- Users can download the data by providing their email address. Once the download package is generated, the user receives an email with a link to the data, information regarding the usage norms and a suggested citation.
- The interface and emails are available in French and English.
As this is a beta version, you may encounter issues. Please report them by clicking the feedback button on the right, which will open a report form.
Technical details
The Canadensys explorer was developed using the following open source tools:
- Spring, a Java framework used for the backend.
- ThreeTen, a date/time Java library, used for cleaning the eventDate field.
- GBIF parsers, a Java library used for cleaning the country field.
- ECAT common, a Java library used for parsing the scientificName field.
- Darwin Core archive reader, a Java library used to import the data in the backend and to create the user generated downloads.
- PostgreSQL, the database.
- PostGIS, a geospatial extension to the database.
- Mapnik, a tool to generate the maps and provide the map interactivity.
- Windshaft, a tile server.
- Freemarker, a Java template engine, used to structure to frontend.
- Backbone.js, a MVC structure for web applications, used for handling the filters.
Tuesday, 12 June 2012
Taxonomic Trees in PostgreSQL
Taken aside pro parte synonyms taxonomic data follows a classic hierarchical tree structure. In relational databases such a tree is commonly represented by 3 models known as the adjacency list, the materialized path and the nested set model. There are many comparisons out there listing pros and cons, for example ON stackoverflow, the slides by Lorenzo Alberton or Bill Karwin or a postgres specific performance comparison between the adjacency model and a nested set.
Checklist Bank
At GBIF we use PostgreSQL to store taxonomic trees, which we refer to as checklists, in Checklist Bank. At the core there is a single table name_usage which contains records each representing a single taxon in the tree [note: in this post I am using the term taxon broadly covering both accepted taxa and synonyms]. It primarily uses the adjacency model with a single foreign key parent_fk which is null for the root elements of the tree. The simplified diagram of the main tables looks like this (actually there are some 20 extra fix width columns left out from name_usage here for simplicity):
For certain searches though an additional index is required. In particular listing all descendants of a taxon, i.e. all members of a subtree, is a common operation that would otherwise involve a recursive function or Common Table Expression. So far we have been using nested sets, but experiencing some badly performing queries lately I decided to do a quick evaluation of different options for Postgres:
- the ltree extension which implements a materialized path and provides many powerful operations.
- the intarray extension to manually manage a materialized path as an array of non null integers - the primary keys of the parent taxa.
- a simple varchar field holding the same materialized path
- the current nested set index using a lft and rgt integer column which is unique within a single taxonomy and therefore has to be combined with a checklist_key.
Test Environment
For the tests I am using the current Checklist Bank which contains 14,907,828 records in total spread across 107 checklists of varying size between 7 and 4.25 million records. I am querying the GBIF Backbone Taxonomy which is the largest checklist containing 4,251,163 records with 9 root taxa and a maximum depth of 11 levels. For the queries I have picked specific 2 taxa with different position in the taxonomic tree:
- 44 Vertebrata the class covering all vertebrates with approximately 84.100 species in this nub.
- 2684876 Abies the fir genus covering 167 species in this nub and exactly 609 descendants.
All tests are executed ON a 2Ghz i7 MacBook Pro with 8GB of memory running Postgres 9.1.1. All queries have been executed 3 times before EXPLAIN ANALYZE was used to get a rough idea ON how differently the various indices behave.
Creating the indices
In order to use the extensions these need to be compiled at installation time and enabled in the database. In Postgres 9.1 you can do the later by executing in psql:
CREATE EXTENSION ltree; CREATE EXTENSION intarray;
The individual data types require different indices. We set up the following indices:
# general indices CREATE INDEX nu_chkl_idx ON name_usage (checklist_fk); CREATE INDEX nu_parent_idx ON name_usage (parent_fk); # tree specifics CREATE INDEX nu_path_idx ON name_usage USING GIST (path); CREATE INDEX nu_mpath_idx ON name_usage USING GIN (mpath gin__int_ops); CREATE INDEX nu_mspath_idx ON name_usage (mspath); CREATE INDEX nu_ckl_lft_rgt_idx ON name_usage (checklist_fk, lft, rgt);
The ltree GIST and intarray GIN indices are rather expensive to create/maintain, but they are selected for best read performance.
Populating the indices
The data being imported into Checklist Bank comes with a parent-child relation, so parent_fk is populated already. For all other indices we have to populate the indices first.
ltree, intarray, string path
For all materialized paths I simply run the following SQL until no new updates happened:
# update root paths once UPDATE name_usage u SET path = text2ltree(cast(u.id as text)), mspath=u.id, mpath = array[u.id] WHERE u.parent_fk is null; # update until no more records are updated UPDATE name_usage u SET path = p.path || text2ltree(cast(u.id as text)), mspath=p.mspath || '.' || u.id, mpath = p.mpath || array[u.id] FROM name_usage p WHERE u.parent_fk=p.id AND p.mspath is not null AND u.mspath is null;
nested sets
I've created 2 functions and one sequence to populate the lft/rgt indices for every checklist:
-- NESTED SET UTILITY FUNCTIONS CREATE SEQUENCE name_usage_nestidx START 1; CREATE FUNCTION update_nested_set_usage(integer) RETURNS BOOLEAN as $$ BEGIN UPDATE name_usage set lft = nextval('name_usage_nestidx')-1 WHERE id=$1; PERFORM update_nested_set_usage(id) FROM usage WHERE parent_fk=$1 ORDER BY rank_fk, name_fk;; UPDATE name_usage set rgt = nextval('name_usage_nestidx')-1 WHERE id=$1; RETURN true; END $$ LANGUAGE plpgsql; CREATE FUNCTION build_nested_set_indices(integer) RETURNS BOOLEAN AS $$ BEGIN PERFORM setval('name_usage_nestidx', 1); PERFORM update_nested_set_usage(id) FROM usage WHERE parent_fk is null and checklist_fk=$1 ORDER BY rank_fk, name_fk;; RETURN true; END $$ LANGUAGE plpgsql;
Query for Descendants
The queries return the first page with 100 records descendants
adjacency
A recursive postgres CTE query does the trick:
WITH RECURSIVE d AS ( SELECT id FROM name_usage WHERE id = 44 UNION ALL SELECT c.id FROM d JOIN name_usage c ON c.parent_fk = d.id ) SELECT * FROM d ORDER BY id LIMIT 100 OFFSET 0;
Query Plan: Abies
Limit (cost=74360.25..74360.50 rows=100 width=4) (actual time=7.306..7.337 rows=100 loops=1) CTE d -> Recursive Union (cost=0.00..72581.02 rows=30561 width=4) (actual time=0.041..6.017 rows=609 loops=1) -> Index Scan using usage_pkey on name_usage (cost=0.00..10.68 rows=1 width=4) (actual time=0.038..0.040 rows=1 loops=1) Index Cond: (id = 2684876) -> Nested Loop (cost=0.00..7195.91 rows=3056 width=4) (actual time=0.240..1.393 rows=152 loops=4) -> WorkTable Scan on d (cost=0.00..0.20 rows=10 width=4) (actual time=0.001..0.041 rows=152 loops=4) -> Index Scan using usage_parent_idx on name_usage c (cost=0.00..715.75 rows=306 width=8) (actual time=0.006..0.008 rows=1 loops=609) Index Cond: (parent_fk = d.id) -> Sort (cost=1779.24..1855.64 rows=30561 width=4) (actual time=7.304..7.316 rows=100 loops=1) Sort Key: d.id Sort Method: top-N heapsort Memory: 29kB -> CTE Scan on d (cost=0.00..611.22 rows=30561 width=4) (actual time=0.046..6.678 rows=609 loops=1) Total runtime: 7.559 ms
Query Plan: Vertebrata
Limit (cost=74360.25..74360.50 rows=100 width=4) (actual time=2065.053..2065.080 rows=100 loops=1) CTE d -> Recursive Union (cost=0.00..72581.02 rows=30561 width=4) (actual time=0.105..1797.898 rows=264325 loops=1) -> Index Scan using usage_pkey on name_usage (cost=0.00..10.68 rows=1 width=4) (actual time=0.101..0.103 rows=1 loops=1) Index Cond: (id = 44) -> Nested Loop (cost=0.00..7195.91 rows=3056 width=4) (actual time=0.915..182.695 rows=29369 loops=9) -> WorkTable Scan on d (cost=0.00..0.20 rows=10 width=4) (actual time=0.007..8.056 rows=29369 loops=9) -> Index Scan using usage_parent_idx on name_usage c (cost=0.00..715.75 rows=306 width=8) (actual time=0.004..0.005 rows=1 loops=264325) Index Cond: (parent_fk = d.id) -> Sort (cost=1779.24..1855.64 rows=30561 width=4) (actual time=2065.050..2065.062 rows=100 loops=1) Sort Key: d.id Sort Method: top-N heapsort Memory: 29kB -> CTE Scan on d (cost=0.00..611.22 rows=30561 width=4) (actual time=0.109..1986.606 rows=264325 loops=1) Total runtime: 2080.248 ms
As expected the recursive query does a pretty good job if the subtree is small. But for the large vertebrate subtree its rather slow because we first get all decendants and then apply a limit.
ltree
With ltree you have various options to query for a subtree. You can use a path lquery with ~, a full text ltxtquery via @ or the ltree <@ isDescendant operator.
Unanchored lqueries for any path containing the a node turned out to be very, very slow. This is expected somehow because it can't use the index properly. Surprisingly even the anchored queries and the full text query were far too slow from being useful at all. The fastest option definitely was the native descendants operator:
WITH p AS ( SELECT path FROM name_usage WHERE id=44 ) SELECT u.id FROM name_usage u, p WHERE u.path <@ p.path ORDER BY u.id LIMIT 100 OFFSET 0;
Query Plan: Abies
Limit (cost=54526.09..54526.34 rows=100 width=4) (actual time=2.926..2.963 rows=100 loops=1) CTE p -> Index Scan using usage_pkey on name_usage (cost=0.00..10.68 rows=1 width=120) (actual time=0.030..0.034 rows=1 loops=1) Index Cond: (id = 2684876) -> Sort (cost=54515.41..54551.25 rows=14336 width=4) (actual time=2.923..2.942 rows=100 loops=1) Sort Key: u.id Sort Method: top-N heapsort Memory: 29kB -> Nested Loop (cost=2094.99..53967.50 rows=14336 width=4) (actual time=1.094..2.230 rows=609 loops=1) -> CTE Scan on p (cost=0.00..0.02 rows=1 width=32) (actual time=0.035..0.040 rows=1 loops=1) -> Bitmap Heap Scan on name_usage u (cost=2094.99..53788.28 rows=14336 width=124) (actual time=1.051..1.952 rows=609 loops=1) Recheck Cond: (path <@ p.path) -> Bitmap Index Scan on nu_path_idx (cost=0.00..2091.40 rows=14336 width=0) (actual time=1.023..1.023 rows=609 loops=1) Index Cond: (path <@ p.path) Total runtime: 3.068 ms
Query Plan: Vertebrata
Limit (cost=54526.09..54526.34 rows=100 width=4) (actual time=512.420..512.445 rows=100 loops=1) CTE p -> Index Scan using usage_pkey on name_usage (cost=0.00..10.68 rows=1 width=120) (actual time=0.091..0.094 rows=1 loops=1) Index Cond: (id = 44) -> Sort (cost=54515.41..54551.25 rows=14336 width=4) (actual time=512.417..512.432 rows=100 loops=1) Sort Key: u.id Sort Method: top-N heapsort Memory: 29kB -> Nested Loop (cost=2094.99..53967.50 rows=14336 width=4) (actual time=115.119..428.632 rows=264325 loops=1) -> CTE Scan on p (cost=0.00..0.02 rows=1 width=32) (actual time=0.096..0.100 rows=1 loops=1) -> Bitmap Heap Scan on name_usage u (cost=2094.99..53788.28 rows=14336 width=124) (actual time=115.016..372.070 rows=264325 loops=1) Recheck Cond: (path <@ p.path) -> Bitmap Index Scan on nu_path_idx (cost=0.00..2091.40 rows=14336 width=0) (actual time=109.791..109.791 rows=264325 loops=1) Index Cond: (path <@ p.path) Total runtime: 512.723 ms
intarray
As a node in the tree appears only once, we can query for all usages that have the node id in their array but are not that very record.
SELECT u.id FROM name_usage u WHERE u.mpath @@ '44' and u.id != 44 ORDER BY u.id LIMIT 100 OFFSET 0;
Query Plan: Abies
Limit (cost=52491.99..52492.24 rows=100 width=4) (actual time=1.925..1.966 rows=100 loops=1) -> Sort (cost=52491.99..52527.83 rows=14336 width=4) (actual time=1.923..1.942 rows=100 loops=1) Sort Key: id Sort Method: top-N heapsort Memory: 29kB -> Bitmap Heap Scan on name_usage u (cost=179.10..51944.08 rows=14336 width=4) (actual time=0.682..1.219 rows=608 loops=1) Recheck Cond: (mpath @@ '2684876'::query_int) Filter: (id <> 2684876) -> Bitmap Index Scan on nu_mpath_idx (cost=0.00..175.52 rows=14336 width=0) (actual time=0.646..0.646 rows=609 loops=1) Index Cond: (mpath @@ '2684876'::query_int) Total runtime: 2.052 ms
Query Plan: Vertebrata
Limit (cost=52491.99..52492.24 rows=100 width=4) (actual time=377.851..377.877 rows=100 loops=1) -> Sort (cost=52491.99..52527.83 rows=14336 width=4) (actual time=377.849..377.861 rows=100 loops=1) Sort Key: id Sort Method: top-N heapsort Memory: 29kB -> Bitmap Heap Scan on name_usage u (cost=179.10..51944.08 rows=14336 width=4) (actual time=115.634..298.193 rows=264324 loops=1) Recheck Cond: (mpath @@ '44'::query_int) Filter: (id <> 44) -> Bitmap Index Scan on nu_mpath_idx (cost=0.00..175.52 rows=14336 width=0) (actual time=110.776..110.776 rows=264325 loops=1) Index Cond: (mpath @@ '44'::query_int) Total runtime: 378.131 ms
string path
Just for completeness and to compare relative performances I am trying an anchored pattern match against a varchar based materialzed path. In order to let postgres use the mspath index we must also order by that value, no id:
WITH p AS ( SELECT mspath FROM name_usage where id=44 ) SELECT u.id FROM name_usage u, p WHERE u.mspath LIKE p.mspath || '.%' ORDER BY u.mspath LIMIT 100 OFFSET 0;
Query Plan: Abies
Limit (cost=10.68..80535.72 rows=100 width=36) CTE p -> Index Scan using usage_pkey on name_usage (cost=0.00..10.68 rows=1 width=32) Index Cond: (id = 2684876) -> Nested Loop (cost=0.00..60022560.36 rows=74539 width=36) Join Filter: (u.mspath ~~ (p.mspath || '.%'::text)) -> Index Scan using nu_mspath_idx on name_usage u (cost=0.00..59649864.65 rows=14907828 width=36) -> CTE Scan on p (cost=0.00..0.02 rows=1 width=32)
I don't know what is going on here, but I can't make postgres use the mspath btree index properly. The query therefore is very, very slow. If I use a hardcoded path instead of p.mspath the index is used. I'll show results here for the hardcoded path now - anyone having an idea on how to avoid the table scan in the above sql please let me know.
SELECT u.id FROM name_usage u WHERE u.mspath LIKE '6.101.194.640.3925.2684876.%' ORDER BY u.mspath LIMIT 100 OFFSET 0;
Query Plan: Abies
Limit (cost=0.00..400.95 rows=100 width=36) (actual time=0.105..0.549 rows=100 loops=1) -> Index Scan using nu_mspath_idx on name_usage u (cost=0.00..298863.67 rows=74539 width=36) (actual time=0.102..0.516 rows=100 loops=1) Index Cond: ((mspath >= '6.101.194.640.3925.2684876.'::text) AND (mspath < '6.101.194.640.3925.2684876/'::text)) Filter: (mspath ~~ '6.101.194.640.3925.2684876.%'::text) Total runtime: 0.637 ms
Query Plan: Vertebrata
Limit (cost=0.00..400.95 rows=100 width=36) (actual time=0.076..0.472 rows=100 loops=1) -> Index Scan using nu_mspath_idx on name_usage u (cost=0.00..298863.67 rows=74539 width=36) (actual time=0.074..0.435 rows=100 loops=1) Index Cond: ((mspath >= '1.44.'::text) AND (mspath < '1.44/'::text)) Filter: (mspath ~~ '1.44.%'::text) Total runtime: 0.561 ms
The hardcoded results are extremely fast. Much faster than any of the specialised indices above. Retrying ltree for example with a hardcoded path doesnt speed up the ltree query. The vast speed gain here is because postgres can use the b-tree index for a sorted output and therefore really benefit from the limit. GiST and GIN indices on the other hand are not suitable for sorting.
nested set
A very different query model from the above, which all use some sort of a materialzed path. We need to also add the checklist_fk condition because the nested set indices are only unique within each taxonomy, not across all.
SELECT u.id FROM name_usage u JOIN name_usage p ON u.checklist_fk=p.checklist_fk WHERE p.id=44 and u.lft BETWEEN p.lft and p.rgt ORDER BY u.lft LIMIT 100 OFFSET 0;
Query Plan: Abies
Limit (cost=81687.08..81687.33 rows=100 width=8) (actual time=2.030..2.076 rows=100 loops=1) -> Sort (cost=81687.08..82326.50 rows=255769 width=8) (actual time=2.029..2.052 rows=100 loops=1) Sort Key: u.lft Sort Method: top-N heapsort Memory: 29kB -> Nested Loop (cost=0.00..71911.77 rows=255769 width=8) (actual time=0.062..1.482 rows=609 loops=1) -> Index Scan using usage_pkey on name_usage p (cost=0.00..10.68 rows=1 width=12) (actual time=0.028..0.029 rows=1 loops=1) Index Cond: (id = 2684876) -> Index Scan using nu_ckl_lft_rgt_idx on name_usage u (cost=0.00..71534.17 rows=20967 width=12) (actual time=0.029..1.182 rows=609 loops=1) Index Cond: ((checklist_fk = p.checklist_fk) AND (lft >= p.lft) AND (lft <= p.rgt)) Total runtime: 2.206 ms
Query Plan: Vertebrata
Limit (cost=81687.08..81687.33 rows=100 width=8) (actual time=475.811..475.829 rows=100 loops=1) -> Sort (cost=81687.08..82326.50 rows=255769 width=8) (actual time=475.809..475.817 rows=100 loops=1) Sort Key: u.lft Sort Method: top-N heapsort Memory: 29kB -> Nested Loop (cost=0.00..71911.77 rows=255769 width=8) (actual time=0.158..381.822 rows=264325 loops=1) -> Index Scan using usage_pkey on name_usage p (cost=0.00..10.68 rows=1 width=12) (actual time=0.075..0.077 rows=1 loops=1) Index Cond: (id = 44) -> Index Scan using nu_ckl_lft_rgt_idx on name_usage u (cost=0.00..71534.17 rows=20967 width=12) (actual time=0.077..323.519 rows=264325 loops=1) Index Cond: ((checklist_fk = p.checklist_fk) AND (lft >= p.lft) AND (lft <= p.rgt)) Total runtime: 475.951 ms
Again the tricky part was making postgres use the lft/rgt index properly when using the order by. Removing the order by turns this query into a super fast one for even the vertebrate query (only o.4ms for any of the two). If we also use hardcoded values instead of a joined p table we can get even better performances than with the string path model. For example the vertebrate query would look like this:
SELECT u.id FROM name_usage u WHERE u.checklist_fk=1 and u.lft BETWEEN 517646 and 1046295 ORDER BY u.lft LIMIT 100 OFFSET 0;
Query Plan: Vertebrata
Limit (cost=0.00..337.88 rows=100 width=8) (actual time=0.080..0.395 rows=100 loops=1) -> Index Scan using nu_ckl_lft_rgt_idx on name_usage u (cost=0.00..1566342.54 rows=463573 width=8) (actual time=0.078..0.356 rows=100 loops=1) Index Cond: ((checklist_fk = 1) AND (lft >= 517646) AND (lft <= 1046295)) Total runtime: 0.480 ms
Query for Ancestors
The ancestor query does not use any limit as its only a few records and we always want all. All materialized paths already have the ancestors - they are the path.
adjacency
A recursive CTE query:
WITH RECURSIVE a AS ( SELECT id, parent_fk, rank_fk FROM name_usage WHERE id = 44 UNION ALL SELECT p.id, p.parent_fk, p.rank_fk FROM a JOIN name_usage p ON a.parent_fk = p.id ) SELECT * FROM a WHERE id!=44 ORDER BY rank_fk;
Query Plan: Abies
Sort (cost=1089.52..1089.77 rows=100 width=12) (actual time=0.155..0.156 rows=5 loops=1) Sort Key: a.rank_fk Sort Method: quicksort Memory: 25kB CTE a -> Recursive Union (cost=0.00..1083.93 rows=101 width=12) (actual time=0.030..0.117 rows=6 loops=1) -> Index Scan using usage_pkey on name_usage (cost=0.00..10.68 rows=1 width=12) (actual time=0.027..0.029 rows=1 loops=1) Index Cond: (id = 2684876) -> Nested Loop (cost=0.00..107.12 rows=10 width=12) (actual time=0.011..0.012 rows=1 loops=6) -> WorkTable Scan on a (cost=0.00..0.20 rows=10 width=4) (actual time=0.001..0.001 rows=1 loops=6) -> Index Scan using usage_pkey on name_usage p (cost=0.00..10.68 rows=1 width=12) (actual time=0.007..0.008 rows=1 loops=6) Index Cond: (id = a.parent_fk) -> CTE Scan on a (cost=0.00..2.27 rows=100 width=12) (actual time=0.068..0.141 rows=5 loops=1) Filter: (id <> 2684876) Total runtime: 0.265 ms
Query Plan: Vertebrata
Sort (cost=1089.52..1089.77 rows=100 width=12) (actual time=0.104..0.104 rows=1 loops=1) Sort Key: a.rank_fk Sort Method: quicksort Memory: 25kB CTE a -> Recursive Union (cost=0.00..1083.93 rows=101 width=12) (actual time=0.040..0.077 rows=2 loops=1) -> Index Scan using usage_pkey on name_usage (cost=0.00..10.68 rows=1 width=12) (actual time=0.037..0.040 rows=1 loops=1) Index Cond: (id = 44) -> Nested Loop (cost=0.00..107.12 rows=10 width=12) (actual time=0.013..0.015 rows=0 loops=2) -> WorkTable Scan on a (cost=0.00..0.20 rows=10 width=4) (actual time=0.001..0.001 rows=1 loops=2) -> Index Scan using usage_pkey on name_usage p (cost=0.00..10.68 rows=1 width=12) (actual time=0.008..0.009 rows=0 loops=2) Index Cond: (id = a.parent_fk) -> CTE Scan on a (cost=0.00..2.27 rows=100 width=12) (actual time=0.079..0.092 rows=1 loops=1) Filter: (id <> 44) Total runtime: 0.232 ms
The trees are not very deep, so even the genus query only has to do 5 recursive calls to return 5 parents.
ltree
select path FROM name_usage WHERE id=44;
The path contains all ancestor ids. Parsing it into separate ids within sql is not obvious at first glance though.
intarray
SELECT mpath FROM name_usage WHERE id=44;
The path contains all ancestor ids. Iterating over each id entry is simple.
nested set
To find out all ancestors of a given node, we just select all nodes that contain its LFT boundary (which in a properly built hierarchy implies containing the RGT boundary too):
SELECT p.id, p.rank_fk, p.lft FROM name_usage u JOIN name_usage p ON u.lft BETWEEN p.lft and p.rgt AND u.checklist_fk=p.checklist_fk WHERE u.id=44 ORDER BY 3;
Query Plan: Abies
Sort (cost=100429.25..101068.67 rows=255769 width=12) (actual time=607.957..607.958 rows=6 loops=1) Sort Key: p.lft Sort Method: quicksort Memory: 25kB -> Nested Loop (cost=0.00..73083.96 rows=255769 width=12) (actual time=415.862..607.938 rows=6 loops=1) -> Index Scan using usage_pkey on name_usage u (cost=0.00..10.68 rows=1 width=8) (actual time=0.046..0.047 rows=1 loops=1) Index Cond: (id = 2684876) -> Index Scan using nu_ckl_lft_rgt_idx on name_usage p (cost=0.00..72706.36 rows=20967 width=20) (actual time=415.809..607.880 rows=6 loops=1) Index Cond: ((checklist_fk = u.checklist_fk) AND (u.lft >= lft) AND (u.lft <= rgt)) Total runtime: 608.034 ms
Query Plan: Vertebrata
Sort (cost=100429.25..101068.67 rows=255769 width=12) (actual time=38.428..38.428 rows=2 loops=1) Sort Key: p.lft Sort Method: quicksort Memory: 25kB -> Nested Loop (cost=0.00..73083.96 rows=255769 width=12) (actual time=0.111..38.410 rows=2 loops=1) -> Index Scan using usage_pkey on name_usage u (cost=0.00..10.68 rows=1 width=8) (actual time=0.022..0.023 rows=1 loops=1) Index Cond: (id = 44) -> Index Scan using nu_ckl_lft_rgt_idx on name_usage p (cost=0.00..72706.36 rows=20967 width=20) (actual time=0.084..38.378 rows=2 loops=1) Index Cond: ((checklist_fk = u.checklist_fk) AND (u.lft >= lft) AND (u.lft <= rgt)) Total runtime: 38.504 ms
Not very efficient.
Query for Children
adjacency
A perfect fit:
SELECT u.id FROM name_usage u WHERE parent_fk=44 ORDER BY u.id LIMIT 100 OFFSET 0;
Query Plan: Abies
Limit (cost=640.97..641.22 rows=100 width=4) (actual time=0.566..0.623 rows=100 loops=1) -> Sort (cost=640.97..641.67 rows=279 width=4) (actual time=0.564..0.588 rows=100 loops=1) Sort Key: id Sort Method: quicksort Memory: 32kB -> Index Scan using usage_parent_idx on name_usage u (cost=0.00..630.31 rows=279 width=4) (actual time=0.046..0.312 rows=167 loops=1) Index Cond: (parent_fk = 2684876) Total runtime: 0.696 ms
Query Plan: Vertebrata
Limit (cost=16863.12..16863.37 rows=100 width=4) (actual time=8.783..8.811 rows=100 loops=1) -> Sort (cost=16863.12..16881.75 rows=7454 width=4) (actual time=8.780..8.793 rows=100 loops=1) Sort Key: id Sort Method: top-N heapsort Memory: 29kB -> Index Scan using usage_parent_idx on name_usage u (cost=0.00..16578.23 rows=7454 width=4) (actual time=0.060..5.169 rows=6083 loops=1) Index Cond: (parent_fk = 44) Total runtime: 8.867 ms
ltree
Search for all descendants that have one more node:
WITH p AS ( SELECT path FROM name_usage WHERE id=2684876 ) SELECT u.id FROM name_usage u, p WHERE u.path <@ p.path and nlevel(u.path)=nlevel(p.path)+1 ORDER BY u.path LIMIT 100 OFFSET 0;
Query Plan: Abies
Limit (cost=56078.37..56078.56 rows=75 width=124) (actual time=4.118..4.161 rows=100 loops=1) CTE p -> Index Scan using usage_pkey on name_usage (cost=0.00..10.68 rows=1 width=120) (actual time=0.029..0.031 rows=1 loops=1) Index Cond: (id = 2684876) -> Sort (cost=56067.69..56067.88 rows=75 width=124) (actual time=4.116..4.134 rows=100 loops=1) Sort Key: u.path Sort Method: quicksort Memory: 48kB -> Nested Loop (cost=2099.42..56065.36 rows=75 width=124) (actual time=1.230..3.241 rows=167 loops=1) Join Filter: (nlevel(u.path) = (nlevel(p.path) + 1)) -> CTE Scan on p (cost=0.00..0.02 rows=1 width=32) (actual time=0.037..0.040 rows=1 loops=1) -> Bitmap Heap Scan on name_usage u (cost=2099.42..55729.91 rows=14908 width=124) (actual time=1.175..2.274 rows=609 loops=1) Recheck Cond: (path <@ p.path) -> Bitmap Index Scan on nu_path_idx (cost=0.00..2095.69 rows=14908 width=0) (actual time=1.146..1.146 rows=609 loops=1) Index Cond: (path <@ p.path) Total runtime: 4.321 ms
Query Plan: Vertebrata
Limit (cost=56078.37..56078.56 rows=75 width=124) (actual time=600.793..600.816 rows=100 loops=1) CTE p -> Index Scan using usage_pkey on name_usage (cost=0.00..10.68 rows=1 width=120) (actual time=0.048..0.051 rows=1 loops=1) Index Cond: (id = 44) -> Sort (cost=56067.69..56067.88 rows=75 width=124) (actual time=600.792..600.805 rows=100 loops=1) Sort Key: u.path Sort Method: top-N heapsort Memory: 32kB -> Nested Loop (cost=2099.42..56065.36 rows=75 width=124) (actual time=112.999..595.738 rows=6083 loops=1) Join Filter: (nlevel(u.path) = (nlevel(p.path) + 1)) -> CTE Scan on p (cost=0.00..0.02 rows=1 width=32) (actual time=0.053..0.057 rows=1 loops=1) -> Bitmap Heap Scan on name_usage u (cost=2099.42..55729.91 rows=14908 width=124) (actual time=112.924..413.889 rows=264325 loops=1) Recheck Cond: (path <@ p.path) -> Bitmap Index Scan on nu_path_idx (cost=0.00..2095.69 rows=14908 width=0) (actual time=107.524..107.524 rows=264325 loops=1) Index Cond: (path <@ p.path) Total runtime: 600.980 ms
intarray
Search for an array that contains the parent id, but has an array size one greater than its parent:
WITH p AS ( SELECT mpath FROM name_usage WHERE id=44 ) SELECT u.id FROM name_usage u, p WHERE u.mpath @@ '44' and #u.mpath= #p.mpath+1 ORDER BY u.mpath LIMIT 100 OFFSET 0;
Query Plan: Abies
Limit (cost=53976.90..53977.09 rows=75 width=56) (actual time=2.026..2.056 rows=100 loops=1) CTE p -> Index Scan using usage_pkey on name_usage (cost=0.00..10.68 rows=1 width=52) (actual time=0.018..0.019 rows=1 loops=1) Index Cond: (id = 2684876) -> Sort (cost=53966.22..53966.41 rows=75 width=56) (actual time=2.024..2.042 rows=100 loops=1) Sort Key: u.mpath Sort Method: quicksort Memory: 48kB -> Hash Join (cost=183.57..53963.89 rows=75 width=56) (actual time=0.341..1.352 rows=167 loops=1) Hash Cond: ((# u.mpath) = (# (p.mpath + 1))) -> Bitmap Heap Scan on name_usage u (cost=183.54..53851.29 rows=14908 width=56) (actual time=0.292..0.874 rows=609 loops=1) Recheck Cond: (mpath @@ '2684876'::query_int) -> Bitmap Index Scan on nu_mpath_idx (cost=0.00..179.81 rows=14908 width=0) (actual time=0.279..0.279 rows=609 loops=1) Index Cond: (mpath @@ '2684876'::query_int) -> Hash (cost=0.02..0.02 rows=1 width=32) (actual time=0.038..0.038 rows=1 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 1kB -> CTE Scan on p (cost=0.00..0.02 rows=1 width=32) (actual time=0.022..0.024 rows=1 loops=1) Total runtime: 2.144 ms
Query Plan: Vertebrata
Limit (cost=53976.90..53977.09 rows=75 width=56) (actual time=501.304..501.327 rows=100 loops=1) CTE p -> Index Scan using usage_pkey on name_usage (cost=0.00..10.68 rows=1 width=52) (actual time=0.075..0.078 rows=1 loops=1) Index Cond: (id = 44) -> Sort (cost=53966.22..53966.41 rows=75 width=56) (actual time=501.303..501.316 rows=100 loops=1) Sort Key: u.mpath Sort Method: top-N heapsort Memory: 32kB -> Hash Join (cost=183.57..53963.89 rows=75 width=56) (actual time=116.159..495.063 rows=6083 loops=1) Hash Cond: ((# u.mpath) = (# (p.mpath + 1))) -> Bitmap Heap Scan on name_usage u (cost=183.54..53851.29 rows=14908 width=56) (actual time=116.030..391.099 rows=264325 loops=1) Recheck Cond: (mpath @@ '44'::query_int) -> Bitmap Index Scan on nu_mpath_idx (cost=0.00..179.81 rows=14908 width=0) (actual time=111.387..111.387 rows=264325 loops=1) Index Cond: (mpath @@ '44'::query_int) -> Hash (cost=0.02..0.02 rows=1 width=32) (actual time=0.101..0.101 rows=1 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 1kB -> CTE Scan on p (cost=0.00..0.02 rows=1 width=32) (actual time=0.082..0.086 rows=1 loops=1) Total runtime: 501.517 ms
nested set
A difficult task. Will leave this as a challenge for some later time.
Summary
The best performing queries do not come from one model. The adjacency model is unbeatable for listing children and with recursive queries in not too deep taxonomic trees it performs also very well to get all ancestors.
For listing descendants the winner are queries that can use an ordered btree index. I could only get this to work with hardcoded paths, so it's not useful like this for dynamic, single statement queries. If you can issue 2 separate queries in code though nested sets or the string materlialized path are ideal candidates for retrieving descendants. Because of a required ordering for doing paging it can use the btree efficiently, while all others produce the full list of decendants first and then order. Intarray is the quickest in that field, but nested sets and ltree perform rather similar. It remains to see if an additional btree index could improve the ordering of ltree and intarray drastically.