Indexing to improve performance of range queries
-
16-10-2019 - |
Question
I've got a relation in a Postgres 8.2 database similar to this:
CREATE TABLE foo (
foo_id varchar(160) NOT NULL,
bar_id varchar(160) NOT NULL,
created bigint NOT NULL,
PRIMARY KEY (foo_id, bar_id)
);
Let's leave aside for the moment the fact that the PK is composite and uses varchars (not my choice, legacy product, etc. etc.).
Currently, our application never issues queries that contain created
in a WHERE clause, so it's not indexed. However, we've got a new requirement that requires us to query on a range of created
values. The proposed query is along the lines of:
SELECT * FROM foo WHERE foo_id IN (...) AND created > 1234 AND created <= 6789
The foo
table is easily the largest in our application, but even then it's quite likely to have fewer than 50,000 rows even in the largest deployment, and there are rarely more than a dozen or so rows with the same foo_id
.
My question is, should an index be added on the created
column, considering that the foo_id
is part of the PK? If so, does it make sense to index only the created
column, or to index on (foo_id
, created
)?
My EXPLAIN on the above statement shows that the PK is being used, and then a FILTER operation is being applied. Using test data, the performance seems fine. My concern is performance if the tables grow to a massive size.
Thanks!
Solution
If you have an index on created
then the planner will need to choose between an using that index or the PK (or a full table scan) - it will not benefit from both at the same time.
--EDIT
As pointed out by @jug in the comments below, this is not accurate at least since 8.1: the planner may choose to build two in-memory bitmaps and combine them to get the result set. This gets more expensive as the tables get bigger, so the planner may choose not to do this depending on the size of the table and the estimated cost of using one index and then filtering.
--END EDIT
The new index will only be helpful if in some cases using it is more efficient than access via the PK. The kind of things that could make this likely include:
- A large number of
(...)
inSELECT * FROM foo WHERE foo_id IN (...) AND created > 1234 AND created <= 6789
- A small range, eg
created > 6780 AND created <= 6790
Unless one or both is likely to happen, you should not create the secondary index - if they might, it would be best to test each scenario with and without the index to see if any performance benefit is worth the cost (eg increased storage and overhead for insert
and update
operation)