Question

Let's say I have a table with 1 000 000 records. Structure of table is:

create table individual (     
  id serial primary key,
  surname varchar(128),
  "name" varchar(128),
  patronymic varchar(128),
  birth_dt date
)

I create a composite index.

INDEX 1
create index on individual using btree (upper(surname) varchar_pattern_ops
  , upper("name") varchar_pattern_ops, patronymic varchar_pattern_ops, birth_dt);

Docs state that varchar_pattern_ops should be applied when using LIKE or pattern match in query. Conclusion: this index will not be used in query below, even it gets only 10 row from 1 000 000.

QUERY 1
select * from individual
order by upper(surname), upper("name"), upper(patronymic), birth_dt limit 10;

and even more, docs recommends to create an index without varchar_pattern_ops as well.

INDEX 2
create index on individual using btree
(upper(surname), upper("name"), upper(patronimyc), birth_dt);

Then a query using LIMIT will use this index.

I found a cheat to force Postgres to use first index on Postgres users forum. It is operator ~<~.

QUERY 2
select * from individual
order by upper(surname) using ~<~
       , upper("name") using ~<~
       , upper(patronymic) using ~<~
       , birth_dt limit 100;

In this case INDEX 1 will be used even if INDEX 2 doesn't exists. I tried to investigate to discover why it happens, but failed.

There are some system tables like pg_operator, which (I think) link operator ~<~ to some functions that most probably uses LIKE or regular expressions.

I ran QUERY 1 and QUERY 2 a few times and compared result manually. It looks like operator ~<~ gives correct result, but I didn't risk anything and just create a normal index anyway.

I am still interested how the Postgres planner decides which index to use index where it meets the operator ~<~ in QUERY 1.

Was it helpful?

Solution

This revolves around COLLATION rules (which are part of your locale). Per documentation:

The C and POSIX collations both specify "traditional C" behavior, in which only the ASCII letters "A" through "Z" are treated as letters, and sorting is done strictly by character code byte values.

Bold emphasis mine. These internal operators:

~<~, ~<=~, ~>=~, ~>~

are members of the operator classes text_pattern_ops, varchar_pattern_ops and bpchar_pattern_ops which ignore current collation rules and default to "traditional C" behavior. Per documentation:

The difference from the default operator classes is that the values are compared strictly character by character rather than according to the locale-specific collation rules.

That's why they can use your first index, while their locale-aware equivalents

<, <=, >=, >

would need the second (using the default operator class).

ORDER BY defaults to the current collation rules. Unless you happen to work with the "C" collation the second set of operators is used for sorting, which can only use indexes with a matching collation (default btree indexes).

Notably, pattern-matching generally relies on "traditional C" behavior, that's why a left anchored LIKE match can use a _patern_ops index with a non-C locale, but not a default index. Compare (last chapter):

In Postgres 9.1+ you can force a selected sort order with COLLATE "locale_name" (the locale has to be available in your system!). So if you run your query with COLLATE "C" (which is a much better option than the evil hack using ~<~) , it can use the _pattern_ops index:

SELECT * FROM individual
ORDER  BY upper(surname) COLLATE "C"
        , upper("name") COLLATE "C"
        , upper(patronymic) COLLATE "C"
        , birth_dt
LIMIT  100;

Results are different. Try with non-ASCII letters like ö, é, etc. Those typically sort depending on your current collation rules, while letters a-z typically sort the same in all locales.

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top