How does PostgreSQL's citext type affect string comparisons made using LOWER()?

https://dba.stackexchange.com/questions/192204

10-10-2020
|

Question

The impetus for my question is that I had hoped that PostgreSQL would behave consistently when selecting from citext columns, regardless of whether or not the string to be matched is wrapped in one or more instances of lower() (any such wrapping is beyond my control). That appears not to be the case. (Of course, it is entirely possible that my tests are invalid or I am misunderstanding fundamental concepts.)

Steps to Reproduce Testing Scenario

CREATE EXTENSION IF NOT EXISTS citext;
CREATE TABLE users (id int, email citext);
INSERT INTO users(id, email) VALUES
  (1, 'USER@example.com');

Tests

As expected when using the citext type, the lowercase variant yields a result:

# select * from users where email = 'user@example.com';
 id |      email
----+------------------
  1 | USER@example.com
(1 row)

Changing the = operator to like yields a result:

select * from users where email like lower('user@example.com');
 id |      email
----+------------------
  1 | USER@example.com
(1 row)

As does the "inverse":

# select * from users where lower(email) = 'user@example.com';
 id |      email
----+------------------
  1 | USER@example.com
(1 row)

As does wrapping both values in lower():

# select * from users where lower(email) = lower('user@example.com');
 id |      email
----+------------------
  1 | USER@example.com
(1 row)

My Question

Why then does the following query not return a result in this instance?

# select * from users where email = lower('user@example.com');
 id | email
----+-------
(0 rows)

The manual says of the citext type:

Essentially, it internally calls lower when comparing values.

The operative word seems to be "essentially"; this statement implies the following, which does yield a result:

select * from users where lower(email) = lower(lower('user@example.com'));
 id |      email
----+------------------
  1 | USER@example.com
(1 row)

Might this be related to the following caveat in the Limitations section of the above-cited document?

citext's case-folding behavior depends on the LC_CTYPE setting of your database.

# SHOW LC_CTYPE;
  lc_ctype
-------------
 en_US.UTF-8
(1 row)

Any explanation in this regard is much appreciated.

Solution

tldr; when comparing case insensitive and sensitive things for equality, you have to be explicit. text is explicitly case-sensitive; citext is explicitly case-insensitive. You should provide a cast for both sides and be explicit

A few things about lower()

lower() is typed
When it's argument are text, it always returns text

A few other points

When you do a comparison with a literal, the type isn't known (it's explicitly unknown internally).
Operators are functions.
Functions coerce the types in PostgreSQL in runtime.

In this case types are as follows, with description

-- text = unknown
-- unknown promoted to text, this has nothing to do with citext
lower(email) = 'user@example.com';

-- text = text
-- this has nothing to do with citext
lower(email) = lower('user@example.com');

-- text = text
-- this has nothing to do with citext    
lower(email) = lower(lower('user@example.com'));

-- citext LIKE text
-- LIKE is smart `operator ~~(citext,text)` via `texticlike`
-- WORKS
email like lower('user@example.com');

-- citext = unknown
-- unknown promoted to citext, there is an `operator =(citext,citext)`
-- WORKS
email = 'user@example.com';

-- citext = text
-- citext promoted to text, there is no `operator =(citext,text)`
-- FAILS
email = lower('user@example.com');

In summary, there is an operator =(citext,citext). So you can

email = lower('user@example.com')::citext;

If you want, or you can define your own operator that sets = to the case insensitive route rather than the case sensitive route. I find that to be horrible practice though, I'll always cast.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange