Comparing SQL Table to itself (Self-join)

https://stackoverflow.com/questions/1890411

19-09-2019
|

Question

I'm trying to find duplicate rows based on mixed columns. This is an example of what I have:

CREATE TABLE Test
(
   id INT PRIMARY KEY,
   test1 varchar(124),
   test2 varchar(124)
)

INSERT INTO TEST ( id, test1, test2 ) VALUES ( 1, 'A', 'B' )
INSERT INTO TEST ( id, test1, test2 ) VALUES ( 2, 'B', 'C' )

Now if I run this query:

SELECT [LEFT].[ID] 
FROM [TEST] AS [LEFT] 
   INNER JOIN [TEST] AS [RIGHT] 
   ON [LEFT].[ID] != [RIGHT].[ID] 
WHERE [LEFT].[TEST1] = [RIGHT].[TEST2]

I would expect to get back both id's. (1 and 2), however I only ever get back the one row.

My thoughts would be that it should compare each row, but I guess this is not correct? To fix this I had changed my query to be:

SELECT [LEFT].[ID] 
FROM [TEST] AS [LEFT] 
   INNER JOIN [TEST] AS [RIGHT] 
   ON [LEFT].[ID] != [RIGHT].[ID] 
WHERE [LEFT].[TEST1] = [RIGHT].[TEST2] 
OR [LEFT].[TEST2] = [RIGHT].[TEST1]

Which gives me both rows, but the performance degrades extremely quickly based on the number of rows.

The final solution I came up for for performance and results was to use a union:

SELECT [LEFT].[ID] 
FROM [TEST] AS [LEFT] 
   INNER JOIN [TEST] AS [RIGHT] 
   ON [LEFT].[ID] != [RIGHT].[ID] 
WHERE [LEFT].[TEST1] = [RIGHT].[TEST2] 
UNION
SELECT [LEFT].[ID] 
FROM [TEST] AS [LEFT] 
   INNER JOIN [TEST] AS [RIGHT] 
   ON [LEFT].[ID] != [RIGHT].[ID] 
WHERE [LEFT].[TEST2] = [RIGHT].[TEST1]

But overall, I'm obviously missing an understanding of why this is not working which means that I'm probably doing something wrong. Could someone point me in the proper direction?

Solution

Do not JOIN on an inequality; it seems that the JOIN and WHERE conditions are inverted.

SELECT t1.id
FROM Test t1
INNER JOIN Test t2
ON ((t1.test1 = t2.test2) OR (t1.test2 = t2.test1))
WHERE t1.id <> t2.id

Should work fine.

OTHER TIPS

You only get back both id's if you select them:

SELECT [LEFT].[ID], [RIGHT].[ID] 
FROM [TEST] AS [LEFT] 
   INNER JOIN [TEST] AS [RIGHT] 
   ON [LEFT].[ID] != [RIGHT].[ID] 
WHERE [LEFT].[TEST1] = [RIGHT].[TEST2]

The reason that only get one ROW is that only one row (namely row #2) has a TEST1 that is equal to another row's TEST2.

I looks like you're working very quickly toward a Cartiesian Join. Normally if you're looking to return duplicates, you need to run something like:

SELECT [LEFT].*
FROM [TEST]  AS [LEFT]
INNER JOIN [TEST] AS [RIGHT]
    ON [LEFT].[test1] = [RIGHT].[test1]
        AND [LEFT].[test2] = [RIGHT].[test2]
        AND [LEFT].[id] <> [RIGHT].[id]

If you need to mix the columns, then mix the needed conditions, but do something like:

SELECT [LEFT].*
FROM [TEST] AS [LEFT]
INNER JOIN [TEST] AS [RIGHT]
    ON (
        [LEFT].[test1] = [RIGHT].[test2]
            OR [LEFT].[test2] = [RIGHT].[test1]
       )
        AND [LEFT].[id] <> [RIGHT].[id]

Using that, you compare the right to the left and the left to the right in each join, eliminating the need for the WHERE altogether.

However, this style of query grows exponentially in execution time for each row inserted into the table, since you're comparing each row to every row.

This can be done with out inner joins if I am not mistaken. This my first time answering mysql kind of question but I am just answering to get more points here on StackOverflow. The comma is very important so that mysql does not complain.

SELECT [LEFT].[ID] FROM [TEST] AS [LEFT], [TEST] AS [RIGHT] 
WHERE [LEFT].[ID] != [RIGHT].[ID] 
AND [LEFT].[TEST1] = [RIGHT].[TEST2];

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow