Question

Say I have a large table containing genomic positions from various files as follows:

CREATE TABLE chromosomal_positions (
    file_id  INT,
    chromosome_id INT, 
    position INT
)

I want to compare the contents of 1 file to the contents of everything all the other files, for overlaps. So I use to derived tables.

SELECT Count(*) 
FROM   (SELECT * 
        FROM   chromosomal_positions 
        WHERE  variant_file_id = 1) AS file_1 
      JOIN (SELECT * 
            FROM   chromosomal_positions 
            WHERE  variant_file_id != 1) AS other_files 
         ON ( file_1.chromosome_id = other_files.chromosome_id 
              AND file_1.position = other_files.position ) 

Now if I add a compound index on file_id, chromosome_id , position in that order, will the derived tables be able to use that index? (Using Postgres)

Was it helpful?

Solution

It's not so much that PostgreSQL "preserves" indexes across subqueries, as that the rewriter can often simplify and restructure your query so that it operates on the base table directly.

In this case the query is unnecessarily complicated; the subqueries can be entirely eliminated, making this a trivial self-join.

SELECT count(*) 
FROM  chromosomal_positions file_1 
INNER JOIN chromosomal_positions other_files
ON ( file_1.chromosome_id = other_files.chromosome_id 
     AND file_1.position = other_files.position ) 
WHERE file1.variant_file_id = 1
AND   other_files.variant_file_id != 1;

so an index on (chromosome_id, position) would be clearly useful here.

You can experiment with index choices and usage as you go, using explain analyze to determine what the query planner is actually doing. For example, if I wanted to see if:

then I would

CREATE INDEX cp_f_c_p ON chromosomal_positions(file_id, chromosome_id , position);

-- Planner would prefer seqscan because there's not really any data;
-- force it to prefer other plans.
SET enable_seqscan = off;

EXPLAIN SELECT count(*) 
FROM (
  SELECT * 
  FROM   chromosomal_positions 
  WHERE  file_id = 1
) AS file_1 
INNER JOIN (
  SELECT * 
  FROM   chromosomal_positions 
  WHERE  file_id != 1
) AS other_files 
ON ( file_1.chromosome_id = other_files.chromosome_id 
     AND file_1.position = other_files.position ) 

and get the result:

                                                                                   QUERY PLAN                                                                                   
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=78.01..78.02 rows=1 width=0)
   ->  Hash Join  (cost=29.27..78.01 rows=1 width=0)
         Hash Cond: ((chromosomal_positions_1.chromosome_id = chromosomal_positions.chromosome_id) AND (chromosomal_positions_1."position" = chromosomal_positions."position"))
         ->  Bitmap Heap Scan on chromosomal_positions chromosomal_positions_1  (cost=14.34..48.59 rows=1930 width=8)
               Filter: (file_id <> 1)
               ->  Bitmap Index Scan on cp_f_c_p  (cost=0.00..13.85 rows=1940 width=0)
         ->  Hash  (cost=14.79..14.79 rows=10 width=8)
               ->  Bitmap Heap Scan on chromosomal_positions  (cost=4.23..14.79 rows=10 width=8)
                     Recheck Cond: (file_id = 1)
                     ->  Bitmap Index Scan on cp_f_c_p  (cost=0.00..4.23 rows=10 width=0)
                           Index Cond: (file_id = 1)
(11 rows)

(view on explain.depesz.com)

showing that while it will use the index, it's actually only using it for the first column. It won't use the rest, it's just filtering on file_id. So the following index is just as good, and smaller and cheaper to maintain:

CREATE INDEX cp_file_id ON chromosomal_positions(file_id);

Sure enough, if you create this index Pg will prefer it. So no, the index you propose does not appear to be useful, unless the planner thinks it's just not worth using at this data scale, and might choose to use it in a completely different plan with more data. You really have to test on the real data to be sure.

By contrast, my proposed index:

CREATE INDEX cp_ci_p ON chromosomal_positions (chromosome_id, position);

is used to find chromosomal positions with id = 1, at least on an empty dummy data set. I suspect the planner would avoid a nested loop on a bigger data set than this, though. So again, you really just have to try it and see.

(BTW, if the planner is forced to materialize a subquery then it does not "preserve indexes on derived tables", i.e. materialized subqueries. This is particularly relevant for WITH (CTE) query terms, which are always materialized).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top