Find rows in a single table where values in a certain column are near-duplicates

https://dba.stackexchange.com/questions/254722

19-02-2021
|

Question

I have a Postgres 11 table on RDS containing a column email; some of the values in this column [and only that column] are clearly de facto duplicates but differ in case, i.e., different capitalization, such as:

foo@****.com
Foo@****.com

To be clear, none of the present rows are true duplicates, nor share precisely the same values in that column. My objective is to identify these records [and, once found, eliminate/merge the de-facto duplicates].

My initial inclination was to use a self-join, e.g.:

SELECT c.email 
FROM schema.table c 
INNER JOIN schema.table d ON lower(c.email) = lower(d.email)
ORDER BY c.email;

However, this returns all the email records rather than only those that are de-facto-duplications.

Using a subquery such as the following produces a similar [i.e., too-inclusive] result:

SELECT c.email, alias.email 
FROM schema.table c 
  JOIN (SELECT email FROM schema.table) alias ON lower(c.email) = lower(alias.email);

Since I’m not looking for an aggregate, but rather a case-insensitive comparison, it seems to me that a window function is not the correct approach.

I think that this should be a straightforward query, but I’m having a difficult time seeing it clearly and am sure there is an error in the way I’m conceiving of the problem; it’s pretty frustrating.

In addition to searching here and on SO, I consulted Molinaro’s SQL Cookbook, but to no avail.

What is the correct way to structure the query so that it returns only those records whose email values are the same, disregarding case?

edit note : my initial question formulation expressed a misguided inclination to use ILIKE for case-insensitive matching, but the use of lower() as suggested in the below answers is far more sensible

Solution

Another case for EXISTS:

SELECT *
FROM   schema.table t
WHERE  EXISTS (
   SELECT FROM schema.table t1
   WHERE  lower(t.email) = lower(t1.email)
   AND    t.ctid <> t1.ctid
   )
ORDER   BY lower(email), email;

If you have a PK, use it instead of ctid. Related:

How do I remove duplicate records in a join table in PostgreSQL?

This returns every qualifying row once. The added ORDER BY helps to keep dupes together and in deterministic sort order (unless your locale is case insensitive).

Why not use a simple join?

If you have, say, 10 variants of the same email, a simple join would give you 10 over 2 = 90 rows, and repeat every combination with reversed roles. Basically a limited Carthesian product of all dupes for each set of dupes.

LOWER LIKE vs iLIKE

A trigram index as suggested there should greatly help performance with tables of non-trivial size.

Also note that lower(t.email) = lower(t1.email) is slightly different from t.email ILIKE t1.email. The latter treats the right side as pattern, where some characters have special meaning unless you escape them. See:

Escape function for regular expression or LIKE patterns

OTHER TIPS

You need to exclude the rows from each other:

SELECT c.email
  FROM schema.table c,
       schema.table d
 WHERE lower(c.email) = lower(d.email)
   AND c.email <> d.email;

One question you will need to answer (and I will update my answer if relevant) -- are you using ILIKE because you want partial matches too? Like, are bert@foo.com and RObert@foo.com supposed to be considered duplicates? (I would assume not, but when people use ILIKE, it's usually to patch parts of strings, as in the bert/robert example)

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange