Indexing to improve performance of range queries

https://dba.stackexchange.com/questions/2675

16-10-2019
|

Question

I've got a relation in a Postgres 8.2 database similar to this:

CREATE TABLE foo (
    foo_id varchar(160) NOT NULL,
    bar_id varchar(160) NOT NULL,
    created bigint NOT NULL,
    PRIMARY KEY (foo_id, bar_id)
);

Let's leave aside for the moment the fact that the PK is composite and uses varchars (not my choice, legacy product, etc. etc.).

Currently, our application never issues queries that contain created in a WHERE clause, so it's not indexed. However, we've got a new requirement that requires us to query on a range of created values. The proposed query is along the lines of:

SELECT * FROM foo WHERE foo_id IN (...) AND created > 1234 AND created <= 6789

The foo table is easily the largest in our application, but even then it's quite likely to have fewer than 50,000 rows even in the largest deployment, and there are rarely more than a dozen or so rows with the same foo_id.

My question is, should an index be added on the created column, considering that the foo_id is part of the PK? If so, does it make sense to index only the created column, or to index on (foo_id, created)?

My EXPLAIN on the above statement shows that the PK is being used, and then a FILTER operation is being applied. Using test data, the performance seems fine. My concern is performance if the tables grow to a massive size.

Thanks!

Solution

If you have an index on created then the planner will need to choose between an using that index or the PK (or a full table scan) - it will not benefit from both at the same time.

--EDIT

As pointed out by @jug in the comments below, this is not accurate at least since 8.1: the planner may choose to build two in-memory bitmaps and combine them to get the result set. This gets more expensive as the tables get bigger, so the planner may choose not to do this depending on the size of the table and the estimated cost of using one index and then filtering.

--END EDIT

The new index will only be helpful if in some cases using it is more efficient than access via the PK. The kind of things that could make this likely include:

A large number of (...) in SELECT * FROM foo WHERE foo_id IN (...) AND created > 1234 AND created <= 6789
A small range, eg created > 6780 AND created <= 6790

Unless one or both is likely to happen, you should not create the secondary index - if they might, it would be best to test each scenario with and without the index to see if any performance benefit is worth the cost (eg increased storage and overhead for insert and update operation)

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange