Equality of “select … where in” and joins
Question
Suppose I have a table1
like this:
id | itemcode
-------------
1 | c1
2 | c2
...
And a table2
like this:
item | name
-----------
c1 | acme
c2 | foo
...
Would the following two queries return the same result set under every condition?
SELECT id, itemcode
FROM table1
WHERE itemcode IN (SELECT DISTINCT item
FROM table2
WHERE name [some arbitrary test])
SELECT id, itemcode
FROM table1
JOIN (SELECT DISTINCT item
FROM table2
WHERE name [some arbitrary test]) items
ON table1.itemcode = items.item
Unless I'm really missing something stupid, I'd say yes. But I've done two queries which boil down to this form and I am getting different results. There are some nested queries using WHERE IN, but for the last step I've noticed a JOIN is much faster. The nested queries are all entirely isolated so I don't believe they are the problem, so I just want to eliminate the possibility that I've got a misconception regarding the above.
Thanks for any insights.
EDIT
The two original queries:
SELECT imitm, imlitm, imglpt
FROM jdedata.F4101
WHERE imitm IN
(SELECT DISTINCT ivitm AS itemno
FROM jdedata.F4104
WHERE ivcitm IN
(SELECT DISTINCT ivcitm AS legacycode
FROM jdedata.F4104
WHERE ivitm IN
(SELECT DISTINCT tritm
FROM trigdata.F4101_TRIG)
)
)
SELECT orig.imitm, orig.imlitm, orig.imglpt
FROM jdedata.F4101 orig
JOIN
(SELECT DISTINCT ivitm AS itemno
FROM jdedata.F4104
WHERE ivcitm IN
(SELECT DISTINCT ivcitm AS legacycode
FROM jdedata.F4104
WHERE ivitm IN
(SELECT DISTINCT tritm
FROM trigdata.F4101_TRIG))) itemns
ON orig.imitm = itemns.itemno
EDIT 2
Although I still don't understand why the queries returned different results, it would seem our logic was flawed from the beginning since we were using the wrong columns in some parts. Mind that I'm not saying I made a mistake interpreting the queries as written above or had some typo, we just needed to select on some different stuff.
Normally I don't rest until I get to the bottom of things like these, but I'm very tired and am entering my first vacation since January that spans more than one day, so I can't really be bothered searching further right now. I'm sure the tips given here will come in handy later. Upvotes have been distributed for all the help and I've accepted Ypercube's answer, mostly because his comments have led me the furthest. But thanks all round! If I do find out more later, I'll try to remember pinging back in.
Solution
Since table2.item
is not nullable, the 2 versions are equivalent. You can remove the distinct
from the IN
version, it's not needed. You can check these 3 versions and their execution plans:
SELECT id, itemcode FROM table1 WHERE itemcode IN
( SELECT item FROM table2 WHERE name [some arbitrary test] )
SELECT id, itemcode FROM table1 JOIN
( SELECT DISTINCT item FROM table2 WHERE name [some arbitrary test] )
items ON table1.itemcode = items.item
SELECT id, itemcode FROM table1 WHERE EXISTS
( SELECT * FROM table2 WHERE table1.itemcode = table2.item
AND (name [some arbitrary test]) )
OTHER TIPS
Ideally I would want to see the differences between the result sets.
- Are you getting duplication of records
- Is one set always a sub-set of the other
- Does one set have both 'additional' and 'missing' records in comparison to the other?
That said, the logic should be equivilent. My best guess would be that you have some empty string entries in there; because Oracle's version of a NULL CHAR/VARCHAR is just an empty string. This can give very funky results if you're not prepared for it.
Both queries perform a semijoin i.e. no attributes from table2
appear in the topmost SELECT
(the resultset).
To my eye, your first query is easiest to identify as a semijoin, EXISTS
even more so. On the other hand, an optimizer would no doubt see it differently ;)
You can also try to do a direct join to the second table
SELECT DISTINCT id, itemcode
FROM table1
INNER JOIN table2 ON table1.itemcode = table2.item
WHERE name [some arbitrary test] )
You don't need distinct if item is primary key or unique
Exists and Inner Join should have the same execution speed, while IN is more expensive.
I'd look for some data type conversion in there.
create table t_vc (val varchar2(6));
create table t_c (val char(6));
insert into t_vc values ('12345');
insert into t_vc values ('12345 ');
insert into t_c values ('12345');
insert into t_c values ('12345');
select t_c.val||':'
from t_c
where val in (select distinct val from t_vc);
select c.val||':'
from t_vc v join (select distinct val from t_c) c on v.val=c.val;