SQL Removing duplicates one row at a time

https://stackoverflow.com/questions/1322032

19-09-2019
|

Question

I have a table where I save all row-changes that have ever occurred. The problem is that in the beginning of the application there was a bug that made a bunch of copies of every row.

The table looks something like this:

copies
|ID |CID |DATA
| 1 | 1  |  DA
| 2 | 2  |  DO
| 2 | 3  |  DO (copy of CID 2)
| 1 | 4  |  DA (copy of CID 1)
| 2 | 5  |  DA
| 1 | 6  |  DA (copy of CID 1)
| 2 | 7  |  DO

CID is UNIQUE in table copies.

What I want is to remove all the duplicates of DATA GROUP BY ID that is after one another sorted by CID.

As you can see in the table, CID 2 and 3 are the same and they are after one another. I would want to remove CID 3. The same with CID 4 and CID 6; they have no ID 1 between them and are copies of CID 1.

After duplicates removal, I would like the table to look like this:

copies
|ID |CID |DATA
| 1 | 1  |  DA
| 2 | 2  |  DO
| 2 | 5  |  DA
| 2 | 7  |  DO

Any suggestions? :)

I think my question was badly asked because the answer everybody seems to think is the best gives this result:

ID   | DATA | DATA | DATA | DATA | DATA |     DATA |        CID          |
                                                   |Expected |  Quassnoi |
1809 |    1 |    0 |    1 |    0 |    0 |     NULL |  252227 |    252227 |
1809 |    1 |    0 |    1 |    1 |    0 |     NULL |  381530 |    381530 |
1809 |    1 |    0 |    1 |    0 |    0 |     NULL |  438158 | (missing) |
1809 |    1 |    0 |    1 |    0 | 1535 | 20090113 |  581418 |    581418 |
1809 |    1 |    1 |    1 |    0 | 1535 | 20090113 |  581421 |    581421 |

CID 252227 AND CID 438158 are duplicates but because CID 381530 comes between them; I want to keep this one. It's only duplicates that are directly after one another when ordering by CID and ID.

Solution

DELETE   c.*
FROM     copies c
JOIN     (
         SELECT  id, data, MIN(copies) AS minc
         FROM    copies
         GROUP BY
                 id, data
         ) q
ON       c.id = q.id
         AND c.data = q.data
         AND c.cid <> q.minc

Update:

DELETE  c.*
FROM    (
        SELECT  cid
        FROM    (
                SELECT  cid,
                        COALESCE(data1 = @data1 AND data2 = @data2, FALSE) AS dup,
                        @data1 := data1,
                        @data2 := data2
                FROM    (
                        SELECT  @data1 := NULL,
                                @data2 := NULL
                        ) vars, copies ci
                ORDER BY
                        id, cid
                ) qi
        WHERE   dup
        ) q
JOIN    copies c
ON      c.cid = q.cid

This solution empoys MySQL session variables.

There is a pure ANSI solution that would use NOT EXISTS, however, it would be slow due to the way MySQL optimizer works (it won't use range access method in a correlated subquery).

See this article in my blog for performance details for quite a close task:

MySQL: difference between sets

OTHER TIPS

You can use a count in a subquery for this:

delete from copies
where
    (select count(*) from copies s where s.id = copies.id 
                                   and s.data = copies.data 
                                   and s.cid > copies.cid) > 0

// EDITED for @Jonathan Leffler comment
//$sql = "SELECT ID,CID,DATA FROM copies ORDER BY CID, ID";
$sql = "SELECT ID,CID,DATA FROM copies ORDER BY ID, CID";
$result = mysql_query($sql, $link); 
$data = "";
$id = "";
while ($row = mysql_fetch_row($result)){ 
       if (($row[0]!=$id) && ($row[2]!=$data) && ($id!="")){
            $sql2 = "DELETE FROM copies WHERE CID=".$row[1];
            $res = mysql_query($sql2, $link); 
       }
       $id=$row[0];
       $data=$row[2];
}

delete from copies c where c.cid in (select max(cid) as max_cid, count(*) as num from copies where num > 1 group by id, data)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow