SQL deduplicate from two columns
-
06-06-2021 - |
Question
I've been struggling with this for quite some time but i just can't figure it out.
I have a table with 3 columns. 2 columns containing names and the third one containing the Damerau Levensthein distance ( http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance ) between those names.
each column has includes every single name, that means all the names present in the autor1 column are present in the autor2 column as well. as a result i have twice the rows needed, just with the autor1 and autor2 columns swapped.
as an example, row 3 equals to row 1 just with the autor columns swapped, the same goes for 2-4. how would i formulate a query that omits those "duplicates"?
id-|------autor1----|------autor2-----| dld
1 -| Abel, Gustav -| Abel, Gustave | 1
2 -| Abel, Gustav -| Abele, Gustav | 1
3 -| Abel, Gustave| Abel, Gustav --| 1
4 -| Abele, Gustav | Abel, Gustav -| 1
to
|------autor1----|------autor2-----| dld
| Abel, Gustav -| Abel, Gustave | 1
| Abel, Gustav -| Abele, Gustav | 1
Solution
Using NOT EXISTS
works on all DBMS's I know of. The complexity of this is to not forget to include a clause on id
. Without it, nothing would be returned.
SELECT *
FROM YourTable yto
WHERE NOT EXISTS (
SELECT *
FROM YourTable yti
WHERE yti.autor2 = yto.autor1
AND yti.id > yto.id
)
Edit
step by step, following is a breakdown of the logic behind the statement
- Get the first record (
ID = 1
) - Is there a record where
ID > 1
andautor1 = autor2
(Yes, ID 3) -> Ignore - Get the next record (
ID = 2
) - Is there a record where
ID > 2
andautor1 = autor2
(Yes, ID 4) -> Ignore - Get the next record (
ID = 3
) - Is there a record where
ID > 3
andautor1 = autor2
(No) -> Include - Get the next record (
ID = 4
) - Is there a record where
ID > 4
andautor1 = autor2
(No) -> Include