
I have an animal table with a name varchar(255), and I've added rows with the following values:

__Starts With 2
Starts With 1
_Starts With 1
_Starts With 1

When I run this query:

zoology=# SELECT name FROM animal ORDER BY name;
_Starts With 1
_Starts With 1
Starts With 1
__Starts With 2
(8 rows)

Notice how the rows are sorted in an order that implies the leading _ is used to place the _Starts With 1 rows before the Starts row, but the __ in the __Starts With 2 seems to ignore this fact, as if the 2 at the end is more important than the first two characters.

Why is this?

If I sort with Python, the result is:

In  [2]: for animal in sorted(animals):
   ....:     print animal
Starts With 1
_Starts With 1
_Starts With 1
__Starts With 2

Furthermore, Python ordering suggests that underscores come after letters, which indicates that the Postgres's sorting of the first two _Starts rows before the Starts row is incorrect.

Note: I'm using Postgres 9.1.15

Here are my attempts at finding the collation:

zoology=# select datname, datcollate from pg_database;
  datname  | datcollate  
 template0 | en_US.UTF-8
 postgres  | en_US.UTF-8
 template1 | en_US.UTF-8
 zoology   | en_US.UTF-8
(4 rows)


zoology=# select table_schema, 
from information_schema.columns
where collation_name is not null
order by table_schema,
 table_schema | table_name | column_name | collation_name 
(0 rows)
Was it helpful?


As you haven't defined a different collation for your column in question, it uses the database-wide one, which is en_US.UTF8 - just like on my test box. I observe the exact same behaviour, take it as a consolation :)

What we see is apparently a case of the variable collation elements. Depending on the character and the collation, a number of different behaviours is possible. Here the underscore (and the hyphen and some others, too) are used only for breaking ties - 'a' and '_a' are equivalent in the first round, then the tie between them is resolved by taking the underscore into account.

If you want to sort with ignoring the underscores (and hyphens, question marks and exclamation marks in my example), you can define an ordering on an expression:

FROM (VALUES ('a'), 
     ) t (val) 
ORDER BY translate(val, '_-?!', '');

In my experiments adding a new value to the list often changes the order between otherwise equal items, showing they are treated really equal.


The Python sort function compares strings element-wise by their Unicode code point numbers - without considering collation rules defined in the current locale (that is active in your environment).

Note that the Unicode code point numbers of the ASCII characters equal the ASCII code numbers. And in ASCII the characters A-Z are ordered before _ which is ordered before a-z; while digits 0-9 are ordered before A-Z.

IOW, when dealing with ASCII strings the Python string ordering equals the byte-wise lexicographic ordering.

You get the same ordering in Postgres by specifying the C locale collation rules with a collation clause like this:

SELECT name FROM animal ORDER BY name COLLATE "C";

Note that the collation rules of other locales might be quite non-intuitive and complicated e.g. because they may compress multiple characters as part of the comparison.

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top