Question

table:

CREATE TABLE msp_adm_munic_complet_g_01
(
  nom_tri character varying(64),
  ogc_fid serial NOT NULL
)

index:

CREATE INDEX idx_gist_msp_adm_munic_complet_g_nom_tri
  ON msp_adm_munic_complet_g_01
  USING gist
  (nom_tri COLLATE pg_catalog."default" gist_trgm_ops);

Query:

select * from msp_adm_munic_complet_g_01
ORDER BY 'potato'<->nom_tri
LIMIT 25;

The question:

Why is it pass through the index with the combination of ORDER BY + LIMIT and not when the query contain only the ORDER BY?

Of course, with the index it increases the speed of query

The only explanation I found was here: http://www.postgresql.org/docs/9.1/static/indexes-ordering.html

but it lack details

EDIT #1 :

query plan with LIMIT :

Limit  (cost=0.00..19.27 rows=25 width=590)
  ->  Index Scan using idx_gist_msp_adm_munic_complet_g_nom_tri on
msp_adm_munic_complet_g_01  (cost=0.00..2784.49 rows=3612 width=590)
      Order By: ((nom_tri)::text <-> 'potato'::text)

query plan with NO LIMIT :

Sort  (cost=1847.59..1856.62 rows=3612 width=590)
  Sort Key: (('potato'::text <-> (nom_tri)::text))
  ->  Seq Scan on msp_adm_munic_complet_g_01  (cost=0.00..682.15 rows=3612 width=590)
Was it helpful?

Solution

Of course, with the index it increases the speed of query

I think this is the nub of the problem. There is no "of course" about it.

Imagine you have a large book. The book has an index in the back, listing the different terms and the page numbers they appear on.

Your boss comes to you and says "I want you to make a list of the first 10 terms in the book, alphabetically, and write down everything about them". You might start with the index, then go to each of the pages listed against the first 10 terms you found. It wouldn't take very long. Especially not compared to the alternative of reading the whole book and trying to sort it in your head, then find the first 10.

Next your boss comes to you and says he wants you to list all the terms in the book with their definitions, in alphabetical order. Naively, you decide to use the same approach. You would be constantly flipping back and forward through the book, revisiting each page many times. It would take forever.

By the time you had finished, you would have read the entire index AND visited each page in the book many times. It would have been quicker just to read through the book, cover to cover, sorting the contents as you went (especially if you were a database, which has a much larger short term memory than a human and can easily sort large lists in it's memory).

This is exactly what happens in a database. It is more efficient for computers to read disk files sequentially, since it doesn't have to seek disk heads back and forth so much. It reads whole pages at a time. It has some advantages over us humans there as well - the enourmouse short term memory means it can keep thousands of pages in it's memory at once. But a large table and/or a heavy workload will defeat that.

So the database analyses each query before it executes it. It will try to estimate what proportion of the table will be returned, together with what it knows about the cost of accessing pages randomly vs. sequentially, and other table statistics about the distribution of values in the table. There will be a point at which it will say it would be more efficient to scan the whole table and forget the index.

You might be thinking that this simplistic analogy does not apply to trigram indices, but it does. The index is not alphabetic, but the mechanism for building a sorted list is the same - except that not all index types are suitable for returning sorted rows in any case. Many index types allow you to find something quickly, but to not maintain the order of the keys. Of the built in index types, only b-tree is suitable for returning sorted data. I am actually mildly surprised that the trigram index would be usable for this. But it depends on the ORDER expression - I guess this index does return data in <-> order.

If iterating through the rows in sorted order is a common operation on this table, there are things you can do to make it go faster.

If you were using Postgresql 9.2, you might be able to make use of index-only scans. In your query, you are selecting all columns which means it could not use an index-only scan, and in any case I don't think you would be able to use index-only scans with a trigram index.

You can use the CLUSTER command to arrange the table in the same sequence as the index (although it won't stay that way when data is inserted or updated, so it needs to be done periodically in an a table that is updated often).

You might find that the table would benefit from fine tuning the statistics that are being kept on it. More statistics might get it to use the index more frequently.

You can tune the parameters that the planner uses to estimate the relative cost of sequentially reading data compared to randomly accessing it. And you could switch to using solid state disks instead of old-school rotating disks.

And of course more RAM never hurt a database.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top