How does PostgreSQL's citext type affect string comparisons made using LOWER()?
-
10-10-2020 - |
Question
The impetus for my question is that I had hoped that PostgreSQL would behave consistently when selecting from citext
columns, regardless of whether or not the string to be matched is wrapped in one or more instances of lower()
(any such wrapping is beyond my control). That appears not to be the case. (Of course, it is entirely possible that my tests are invalid or I am misunderstanding fundamental concepts.)
Steps to Reproduce Testing Scenario
CREATE EXTENSION IF NOT EXISTS citext;
CREATE TABLE users (id int, email citext);
INSERT INTO users(id, email) VALUES
(1, 'USER@example.com');
Tests
As expected when using the citext
type, the lowercase variant yields a result:
# select * from users where email = 'user@example.com';
id | email
----+------------------
1 | USER@example.com
(1 row)
Changing the =
operator to like
yields a result:
select * from users where email like lower('user@example.com');
id | email
----+------------------
1 | USER@example.com
(1 row)
As does the "inverse":
# select * from users where lower(email) = 'user@example.com';
id | email
----+------------------
1 | USER@example.com
(1 row)
As does wrapping both values in lower()
:
# select * from users where lower(email) = lower('user@example.com');
id | email
----+------------------
1 | USER@example.com
(1 row)
My Question
Why then does the following query not return a result in this instance?
# select * from users where email = lower('user@example.com');
id | email
----+-------
(0 rows)
The manual says of the citext
type:
Essentially, it internally calls lower when comparing values.
The operative word seems to be "essentially"; this statement implies the following, which does yield a result:
select * from users where lower(email) = lower(lower('user@example.com'));
id | email
----+------------------
1 | USER@example.com
(1 row)
Might this be related to the following caveat in the Limitations
section of the above-cited document?
citext's case-folding behavior depends on the LC_CTYPE setting of your database.
# SHOW LC_CTYPE;
lc_ctype
-------------
en_US.UTF-8
(1 row)
Any explanation in this regard is much appreciated.
Solution
tldr; when comparing case insensitive and sensitive things for equality, you have to be explicit. text
is explicitly case-sensitive; citext
is explicitly case-insensitive. You should provide a cast for both sides and be explicit
A few things about lower()
lower()
is typed- When it's argument are
text
, it always returnstext
A few other points
- When you do a comparison with a literal, the type isn't known (it's explicitly
unknown
internally). - Operators are functions.
- Functions coerce the types in PostgreSQL in runtime.
In this case types are as follows, with description
-- text = unknown
-- unknown promoted to text, this has nothing to do with citext
lower(email) = 'user@example.com';
-- text = text
-- this has nothing to do with citext
lower(email) = lower('user@example.com');
-- text = text
-- this has nothing to do with citext
lower(email) = lower(lower('user@example.com'));
-- citext LIKE text
-- LIKE is smart `operator ~~(citext,text)` via `texticlike`
-- WORKS
email like lower('user@example.com');
-- citext = unknown
-- unknown promoted to citext, there is an `operator =(citext,citext)`
-- WORKS
email = 'user@example.com';
-- citext = text
-- citext promoted to text, there is no `operator =(citext,text)`
-- FAILS
email = lower('user@example.com');
In summary, there is an operator =(citext,citext)
. So you can
email = lower('user@example.com')::citext;
If you want, or you can define your own operator that sets =
to the case insensitive route rather than the case sensitive route. I find that to be horrible practice though, I'll always cast.