SQL: Detect duplicate customers

https://stackoverflow.com/questions/9378630

28-10-2019
|

Question

im trying to create a sql query, that will detect (possible) duplicate customers in my database:

I have two tables:

Customer with the columns: cid, firstname, lastname, zip. Note that cid is the unique customer id and primary key for this table.
IgnoreForDuplicateCustomer with the columns: cid1, cid2. Both columns are foreign keys, which references to Customer(cid). This table is used to say, that the customer with cid1 is not the same as the customer with the cid2.

So for example, if i have

a Customer entry with cid = 1, firstname="foo", lastname="anonymous" and zip="11231"
and another Customer entry with cid=2, firstname="foo", lastname="anonymous" and zip="11231".

So my sql query should search for customers, that have the same firstname, lastname and zip and the detect that customer with cid = 1 is the same as customer with cid = 2.

However, it should be possible to say, that customer cid = 1 and cid=2 are not the same, by storing a new entry in the IgnoreForDuplicateCustomer table by setting cid1 = 1 and cid2 = 2.

So detecting the duplicate customers work well with this sql query script:

SELECT cid, firstname, lastname, zip, COUNT(*) AS NumOccurrences
       FROM Customer
 GROUP BY fistname, lastname,zip
       HAVING ( COUNT(*) > 1 )

My problem is, that i am not able, to integrate the IgnoreForDuplicateCustomer table, to that like in my previous example the customer with cid = 1 and cid=2 will not be marked / queried as the same, since there is an entry/rule in the IgnoreForDuplicateCustomer table.

So i tried to extend my previous query by adding a where clause:

    SELECT cid, firstname, lastname, COUNT(*) AS NumOccurrences
               FROM Customer    
    WHERE cid NOT IN (
                     SELECT cid1 FROM IgnoreForDuplicateCustomer WHERE cid2=cid 
                     UNION 
                     SELECT cid2 FROM IgnoreForDuplicateCustomer WHERE cid1=cid
                     )  
     GROUP BY firstname, lastname, zip
     HAVING ( COUNT(*) > 1 )

Unfortunately this additional WHERE clause has absolutely no impact on my result. Any suggestions?

Solution

Here you are:

Select a.*
From (
  select c1.cid 'CID1', c2.cid 'CID2'
  from Customer c1 
  join Customer c2 on c1.firstname=c2.firstname 
    and c1.lastname=c2.lastname and c1.zip=c2.zip
    and c1.cid < c2.cid) a
Left Join (
  Select cid1 'CID1', cid2 'CID2'
  From ignoreforduplicatecustomer one
 Union
  Select cid2 'CID1', cid1 'CID2'
  From ignoreforduplicatecustomer two) b on a.cid1 = b.cid1 and a.cid2 = b.cid2
where b.cid1 is null

This will get you the IDs of duplicate records from customer table, which are not in table ignoreforduplicatecustomer.

Tested with:

CREATE TABLE IF NOT EXISTS `customer` (
 `CID` int(11) NOT NULL AUTO_INCREMENT,
 `Firstname` varchar(50) NOT NULL,
 `Lastname` varchar(50) NOT NULL,
 `ZIP` varchar(10) NOT NULL,
 PRIMARY KEY (`CID`)) 
ENGINE=InnoDB  DEFAULT CHARSET=latin1 AUTO_INCREMENT=100 ;

INSERT INTO `customer` (`CID`, `Firstname`, `Lastname`, `ZIP`) VALUES
(1, 'John', 'Smith', '1234'),
(2, 'John', 'Smith', '1234'),
(3, 'John', 'Smith', '1234'),
(4, 'Jane', 'Doe', '1234');

And:

CREATE TABLE IF NOT EXISTS `ignoreforduplicatecustomer` (
 `CID1` int(11) NOT NULL,
 `CID2` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;


INSERT INTO `ignoreforduplicatecustomer` (`CID1`, `CID2`) VALUES
(1, 2);

Results for my test setup are:

CID1  CID2
 1     3
 2     3

OTHER TIPS

Edit as per TPete's comment (dind't try it):

SELECT 
    C1.cid, C1.firstname, C1.lastname
FROM 
    Customer C1,
    Customer C2
WHERE
    C1.cid < C2.cid AND 
    C1.firstname = C2.firstname AND 
    C1.lastname = C2.lastname AND 
    C1.zip = C2.zip AND 
    CAST(C1.cid AS VARCHAR)+' ' +CAST(C2.cid AS VARCHAR) <> 
       (SELECT CAST(cid1 AS VARCHAR)+' '+CAST(cid2 AS VARCHAR) FROM IgnoreForDuplicateCustomer I WHERE I.cid1 = C1.cid AND I.cid2 = C2.cid);

Initially I thought that IgnoreForDuplicateCustomer was a field in the customer table.

crazy but I think it works :)

first I join the customer tables with itself on the names to get the duplicates then I exclud the keys on the IgnoreForDuplicateCustomer table (the union is because the first query returns cid1, cid2 and cid2,cid1

the result will be duplicated but I think you can get the info you need

select c1.cid, c2.cid
from Customer c1 
     join Customer c2 on c1.firstname=c2.firstname 
     and c1.lastname=c2.lastname and c1.zip=c2.zip
     and c1.cid!=c2.cid
except 
(
    select cid1,cid2 from IgnoreForDuplicateCustomer
    UNION
    select cid2,cid1 from IgnoreForDuplicateCustomer
)

second shot:

select firstname,lastname,zip from Customer 
group by firstname,lastname,zip 
having (count(*)>1)
except
select c1.firstname, c1.lastname, c1.zip
from Customer c1 join IgnoreForDuplicateCustomer IG on c1.cid=ig.cid1 join Customer c2 on ig.cid2=c2.cid

third:

select firstname,lastname,zip from (
    select firstname,lastname,zip from Customer 
    group by firstname,lastname,zip 
    having (count(*)>1)
) X
where firstname not in (
select c1.firstname
from Customer c1 join IgnoreForDuplicateCustomer IG on c1.cid=ig.cid1 join Customer c2 on ig.cid2=c2.cid
)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow