How can you find the rows with equal columns?

https://stackoverflow.com/questions/1436747

08-07-2019
|

Question

If I have a table with important 2 columns,

CREATE TABLE foo (id INT, a INT, b INT, KEY a, KEY b);

How can I find all the rows that have both a and b being the same in both rows? For example, in this data set

id | a | b
----------
1  | 1 | 2
2  | 5 | 42
3  | 1 | 42
4  | 1 | 2 
5  | 1 | 2
6  | 1 | 42

I want to get back all rows except for id=2 since it is unique in (a,b). Basically, I want to find all offending rows that would stop a

ALTER TABLE foo ADD UNIQUE (a, b);

Something better than an n^2 for loop would be nice since my table has 10M rows.

For bonus points : How do I removed all but one of the rows (I don't care which ones, as long as one is left)

Solution

SELECT * 
FROM foo first
JOIN foo second
  ON ( first.a = second.a
       AND first.b = second.b ) 
  AND (first.id <> second.id )

Should come up with all the rows where more that one row has the same combination of a and b.

Just hope you have an index on columns a and b.

OTHER TIPS

select * from foo where a = b

Or am I missing something?

===

Update for clarity:

select * from 
foo as a
inner join foo as b
on a.a = b.a AND b.a = b.b
and a.id != b.id

++++++++++ After 3rd clarity edit:

select f1.id
FROM foo as f1
INNER JOIN foo as f2
ON f1.a = f2.a AND f1.b=f2.b AND f1.id != f2.id

But I'm shot, so check it yourself.

Try this:

    With s as (Select a,b from foo group by a,b having Count(1)>1)
Select foo.* from foo,s where foo.a=s.a and foo.b=s.b

This query should show duplicate rows in the table foo.

Could you please clarify what you need to do ultimately? The best solution may depend on that (e.g., do you simply want to delete all dupliucate-key rows?)

One way is to handle this table (not sure if mySQL supports it, it's from SYBASE) if all you want is unique-keyed rows:

SELECT MIN(id), A, B FROM FOO GROUP BY A, B HAVING COUNT(*)>1

Your exact question (although I'm a bit at a loss as to why you'd need all rows except id=2) is:

SELECT F1.*  
FROM FOO F1 , 
     (SELECT A, B FROM FOO GROUP BY A, B HAVING COUNT(*)>1) F2
WHERE F1.A=F2.A and F1.B=F2.B

To delete all the duplicates, you can for example do

DELETE FOO WHERE NOT EXISTS
(SELECT 1 from
    (SELECT MIN(id) 'min_id' FROM FOO GROUP BY A, B HAVING COUNT(*)>1) UINIQUE_IDS 
 WHERE id = min_id)

As an alternative, you can do

  SELECT MIN(id) 'id', A, B INTO TEMPDB..NEW_TABLE 
  FROM FOO GROUP BY A, B HAVING COUNT(*)>1

  TRUNCATE TABLE FOO
  // Drop indices on FOO
  INSERT FOO SELECT * FROM NEW_TABLE
  // Recreate indices on FOO

shouldn't this work?

SELECT * FROM foo WHERE a = b

=== edit ===

the how about

SELECT a, b FROM foo GROUP BY a, b HAVING COUNT(*) > 1

=== final re-edit before i give up on this question ===

SELECT foo.* FROM foo, (
   SELECT a, b FROM foo GROUP BY a, b HAVING COUNT(*) > 1
) foo2
WHERE foo.a = foo2.a AND foo.b = foo2.b

here's another approach

select * from foo f1 where exists(
  select * from foo f2 where
    f1.id != f2.id and
    f1.a = f2.a and
    f1.b = f2.b )

anyway, even though I find it a bit more readable, if you have such a huge table, you should check the execution plan, subqueries have a bad reputation involving performance...

you should also consider creating the index (without the unique clause, obviously) to speed up the query... for huge operations, sometimes it's better to spend the time creating the index, perform the update and then drop the index... in this case, I guess an index on (a, b) should certainly help a lot...

Your stated goal is to remove all duplicate combination of (a,b). For that, you can use a multi-table DELETE:

DELETE t1
  FROM foo t1
  JOIN foo t2 USING (a, b)
 WHERE t2.id > t1.id

Before you run it, you can check which rows will be removed with:

SELECT DISTINCT t1.id
  FROM foo t1
  JOIN foo t2 USING (a, b)
 WHERE t2.id > t1.id

The WHERE clause being t2.id > t1.id it will remove all but the one with the highest value for id. In your case, only the rows with id equal to 2, 5 or 6 would remain.

If the id value doesn't matter at all in the final product, that is, if you could renumber them all and it would be fine, and if id is a serial column, then just "select distinct" on the two columns into a new table, delete all the data from the old table, and then copy the temporary values back in.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow