SQL Removing duplicates one row at a time
-
19-09-2019 - |
Question
I have a table where I save all row-changes that have ever occurred. The problem is that in the beginning of the application there was a bug that made a bunch of copies of every row.
The table looks something like this:
copies
|ID |CID |DATA
| 1 | 1 | DA
| 2 | 2 | DO
| 2 | 3 | DO (copy of CID 2)
| 1 | 4 | DA (copy of CID 1)
| 2 | 5 | DA
| 1 | 6 | DA (copy of CID 1)
| 2 | 7 | DO
CID is UNIQUE in table copies.
What I want is to remove all the duplicates of DATA GROUP BY ID that is after one another sorted by CID.
As you can see in the table, CID 2 and 3 are the same and they are after one another. I would want to remove CID 3. The same with CID 4 and CID 6; they have no ID 1 between them and are copies of CID 1.
After duplicates removal, I would like the table to look like this:
copies
|ID |CID |DATA
| 1 | 1 | DA
| 2 | 2 | DO
| 2 | 5 | DA
| 2 | 7 | DO
Any suggestions? :)
I think my question was badly asked because the answer everybody seems to think is the best gives this result:
ID | DATA | DATA | DATA | DATA | DATA | DATA | CID |
|Expected | Quassnoi |
1809 | 1 | 0 | 1 | 0 | 0 | NULL | 252227 | 252227 |
1809 | 1 | 0 | 1 | 1 | 0 | NULL | 381530 | 381530 |
1809 | 1 | 0 | 1 | 0 | 0 | NULL | 438158 | (missing) |
1809 | 1 | 0 | 1 | 0 | 1535 | 20090113 | 581418 | 581418 |
1809 | 1 | 1 | 1 | 0 | 1535 | 20090113 | 581421 | 581421 |
CID 252227 AND CID 438158 are duplicates but because CID 381530 comes between them; I want to keep this one. It's only duplicates that are directly after one another when ordering by CID and ID.
Solution
DELETE c.*
FROM copies c
JOIN (
SELECT id, data, MIN(copies) AS minc
FROM copies
GROUP BY
id, data
) q
ON c.id = q.id
AND c.data = q.data
AND c.cid <> q.minc
Update:
DELETE c.*
FROM (
SELECT cid
FROM (
SELECT cid,
COALESCE(data1 = @data1 AND data2 = @data2, FALSE) AS dup,
@data1 := data1,
@data2 := data2
FROM (
SELECT @data1 := NULL,
@data2 := NULL
) vars, copies ci
ORDER BY
id, cid
) qi
WHERE dup
) q
JOIN copies c
ON c.cid = q.cid
This solution empoys MySQL
session variables.
There is a pure ANSI
solution that would use NOT EXISTS
, however, it would be slow due to the way MySQL
optimizer works (it won't use range
access method in a correlated subquery).
See this article in my blog for performance details for quite a close task:
OTHER TIPS
You can use a count
in a subquery for this:
delete from copies
where
(select count(*) from copies s where s.id = copies.id
and s.data = copies.data
and s.cid > copies.cid) > 0
// EDITED for @Jonathan Leffler comment
//$sql = "SELECT ID,CID,DATA FROM copies ORDER BY CID, ID";
$sql = "SELECT ID,CID,DATA FROM copies ORDER BY ID, CID";
$result = mysql_query($sql, $link);
$data = "";
$id = "";
while ($row = mysql_fetch_row($result)){
if (($row[0]!=$id) && ($row[2]!=$data) && ($id!="")){
$sql2 = "DELETE FROM copies WHERE CID=".$row[1];
$res = mysql_query($sql2, $link);
}
$id=$row[0];
$data=$row[2];
}
delete from copies c where c.cid in (select max(cid) as max_cid, count(*) as num from copies where num > 1 group by id, data)