Does PostgreSQL support ICU collation's options and settings?

https://dba.stackexchange.com/questions/191905

10-10-2020
|

Question

ICU Specifies different LDML Collation Settings. Some of them seem pretty interesting, especially the ones on case and accent,

“Ignore accents”: strength=primary

“Ignore accents” but take case into account: strength=primary caseLevel=on

“Ignore case”: strength=secondary

“Ignore punctuation” (completely): strength=tertiary alternate=shifted

“Ignore punctuation” but distinguish among punctuation marks: strength=quaternary alternate=shifted potentially a better method of doing what

You can also see these documented here. Are these ICU options and settings possible with PostgreSQL 10 ICU collation support?

CREATE COLLATION special (provider = icu, locale = 'en@strength=primary');
SELECT 'Å' LIKE 'A' COLLATE "special"; # returns false

I've also tried the CLDR BCP47

Starting with ICU 54, collation attributes can be specified via locale keywords as well, in the old locale extension syntax ("el@colCaseFirst=upper") or in language tag syntax ("el-u-kf-upper"). Keywords and values are case-insensitive. See the LDML Collation spec, Collation Settings, and the data file listing the valid collation keywords and their values. (The deprecated attributes kh/colHiraganaQuaternary and vt/variableTop are not supported.)

For that, this looked right

CREATE COLLATION special (provider = icu, locale = 'en-ks-level1');
SELECT 'Å' LIKE 'A' COLLATE "special"; # returns false

Also tried en-u-ks-level1 That method seems to be what the docs go for,

CREATE COLLATION german_phonebook (provider = icu, locale = 'de-u-co-phonebk');

Solution

Case insensitive or accent-insensitive collations cannot be used prior to PostgreSQL 12, because internally PostgreSQL considers that strings with a different binary representation are not equal. When the collation-aware comparator says they are equal, it uses the non-collation-aware strcmp() function as a tie-breaker, to obtain what Unicode calls a "deterministic" comparison without normalization.

Starting with PostgreSQL 12, collations have a deterministic property, which must be set to false to benefit from equality of non-binary equal strings. From CREATE COLLATION:

DETERMINISTIC

Specifies whether the collation should use deterministic comparisons. The default is true. A deterministic comparison considers strings that are not byte-wise equal to be unequal even if they are considered logically equal by the comparison. PostgreSQL breaks ties using a byte-wise comparison. Comparison that is not deterministic can make the collation be, say, case- or accent-insensitive. For that, you need to choose an appropriate LC_COLLATE setting and set the collation to not deterministic here.

Nondeterministic collations are only supported with the ICU provider.

The exact query in the question is not possible with nondeterministic collations because they don't support LIKE or any form of pattern matching (as of PostgreSQL 12). Also the locale should have a -u- before the collation subtags, otherwise they're going to be silently ignored by the collator.

What does work as intended:

=# CREATE COLLATION special (
     provider = icu, locale = 'en-u-ks-level1', deterministic=false
   );

=# SELECT 'Å' = 'A' COLLATE "special";
 ?column? 
----------
 t

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange