Equality of “select … where in” and joins

https://stackoverflow.com/questions/7174872

11-01-2021
|

Question

Suppose I have a table1 like this:

id | itemcode
-------------
1  | c1
2  | c2
...

And a table2 like this:

item | name
-----------
c1   | acme
c2   | foo
...

Would the following two queries return the same result set under every condition?

SELECT id, itemcode 
FROM table1 
WHERE itemcode IN (SELECT DISTINCT item 
                  FROM table2 
                  WHERE name [some arbitrary test])

SELECT id, itemcode 
FROM table1 
   JOIN (SELECT DISTINCT item 
          FROM table2
          WHERE name [some arbitrary test]) items 
         ON table1.itemcode = items.item

Unless I'm really missing something stupid, I'd say yes. But I've done two queries which boil down to this form and I am getting different results. There are some nested queries using WHERE IN, but for the last step I've noticed a JOIN is much faster. The nested queries are all entirely isolated so I don't believe they are the problem, so I just want to eliminate the possibility that I've got a misconception regarding the above.

Thanks for any insights.

EDIT

The two original queries:

SELECT imitm, imlitm, imglpt 
    FROM jdedata.F4101 
    WHERE imitm IN 
  (SELECT DISTINCT ivitm AS itemno 
       FROM jdedata.F4104 
       WHERE ivcitm IN 
    (SELECT DISTINCT ivcitm AS legacycode 
             FROM jdedata.F4104 
              WHERE ivitm IN 
      (SELECT DISTINCT tritm 
               FROM trigdata.F4101_TRIG)
    )
  )


SELECT orig.imitm, orig.imlitm, orig.imglpt 
    FROM jdedata.F4101 orig 
        JOIN 
        (SELECT DISTINCT ivitm AS itemno 
        FROM jdedata.F4104 
        WHERE ivcitm IN 
              (SELECT DISTINCT ivcitm AS legacycode 
              FROM jdedata.F4104
               WHERE ivitm IN 
                 (SELECT DISTINCT tritm 
                  FROM trigdata.F4101_TRIG))) itemns 
ON orig.imitm = itemns.itemno

EDIT 2

Although I still don't understand why the queries returned different results, it would seem our logic was flawed from the beginning since we were using the wrong columns in some parts. Mind that I'm not saying I made a mistake interpreting the queries as written above or had some typo, we just needed to select on some different stuff.

Normally I don't rest until I get to the bottom of things like these, but I'm very tired and am entering my first vacation since January that spans more than one day, so I can't really be bothered searching further right now. I'm sure the tips given here will come in handy later. Upvotes have been distributed for all the help and I've accepted Ypercube's answer, mostly because his comments have led me the furthest. But thanks all round! If I do find out more later, I'll try to remember pinging back in.

Solution

Since table2.item is not nullable, the 2 versions are equivalent. You can remove the distinct from the IN version, it's not needed. You can check these 3 versions and their execution plans:

SELECT id, itemcode FROM table1 WHERE itemcode IN
  ( SELECT item FROM table2 WHERE name [some arbitrary test] )

SELECT id, itemcode FROM table1 JOIN
  ( SELECT DISTINCT item FROM table2 WHERE name [some arbitrary test] )
  items ON table1.itemcode = items.item

SELECT id, itemcode FROM table1 WHERE EXISTS
  ( SELECT * FROM table2 WHERE table1.itemcode = table2.item 
                           AND (name [some arbitrary test]) )

OTHER TIPS

Ideally I would want to see the differences between the result sets.
- Are you getting duplication of records
- Is one set always a sub-set of the other
- Does one set have both 'additional' and 'missing' records in comparison to the other?

That said, the logic should be equivilent. My best guess would be that you have some empty string entries in there; because Oracle's version of a NULL CHAR/VARCHAR is just an empty string. This can give very funky results if you're not prepared for it.

Both queries perform a semijoin i.e. no attributes from table2 appear in the topmost SELECT (the resultset).

To my eye, your first query is easiest to identify as a semijoin, EXISTS even more so. On the other hand, an optimizer would no doubt see it differently ;)

You can also try to do a direct join to the second table

SELECT DISTINCT id, itemcode 
FROM table1 
INNER JOIN table2 ON table1.itemcode = table2.item   
WHERE name [some arbitrary test] )

You don't need distinct if item is primary key or unique

Exists and Inner Join should have the same execution speed, while IN is more expensive.

I'd look for some data type conversion in there.

create table t_vc (val varchar2(6));
create table t_c (val char(6));

insert into t_vc values ('12345');
insert into t_vc values ('12345 ');

insert into t_c values ('12345');
insert into t_c values ('12345');

select t_c.val||':'
from t_c
where val in (select distinct val from t_vc);

select c.val||':'
from t_vc v join (select distinct val from t_c) c on v.val=c.val;

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow