Performance of SQL “EXISTS” usage variants

https://stackoverflow.com/questions/424212

05-07-2019
|

Question

Is there any difference in the performance of the following three SQL statements?

SELECT * FROM tableA WHERE EXISTS (SELECT * FROM tableB WHERE tableA.x = tableB.y)

SELECT * FROM tableA WHERE EXISTS (SELECT y FROM tableB WHERE tableA.x = tableB.y)

SELECT * FROM tableA WHERE EXISTS (SELECT 1 FROM tableB WHERE tableA.x = tableB.y)

They all should work and return the same result set. But does it matter if the inner SELECT selects all fields of tableB, one field, or just a constant?

Is there any best practice when all statements behave equal?

Solution

The truth about the EXISTS clause is that the SELECT clause is not evaluated in an EXISTS clause - you could try:

SELECT * 
  FROM tableA 
 WHERE EXISTS (SELECT 1/0 
                 FROM tableB 
                WHERE tableA.x = tableB.y)

...and should expect a divide by zero error, but you won't because it's not evaluated. This is why my habit is to specify NULL in an EXISTS to demonstrate that the SELECT can be ignored:

SELECT * 
  FROM tableA 
 WHERE EXISTS (SELECT NULL
                 FROM tableB 
                WHERE tableA.x = tableB.y)

All that matters in an EXISTS clause is the FROM and beyond clauses - WHERE, GROUP BY, HAVING, etc.

This question wasn't marked with a database in mind, and it should be because vendors handle things differently -- so test, and check the explain/execution plans to confirm. It is possible that behavior changes between versions...

OTHER TIPS

Definitely #1. It "looks" scary, but realize the optimizer will do the right thing and is expressive of intent. Also ther is a slight typo bonus should one accidently think EXISTS but type IN. #2 is acceptable but not expressive. The third option stinks in my not so humble opinion. It's too close to saying "if 'no value' exists" for comfort.

In general it's important to not be scared to write code that mearly looks inefficient if it provides other benefits and does not actually affect performance.

That is, the optimizer will almost always execute your complicated join/select/grouping wizardry to save a simple EXISTS/subquery the same way.

After having given yourself kudos for cleverly rewriting that nasty OR out of a join you will eventually realize the optimizer still used the same crappy execution plan to resolve the much easier to understand query with embedded OR anyway.

The moral of the story is know your platforms optimizer. Try different things and see what is actually being done because the rampant knee jerks assumptions regarding 'decorative' query optimization are almost always incorrect and irrelevant from my experience.

I realize this is an old post, but I thought it important to add clarity about why one might choose one format over another.

First, as others have pointed out, the database engine is supposed to ignore the Select clause. Every version of SQL Server has/does, Oracle does, MySQL does and so on. In many, many moons of database development, I have only ever encountered one DBMS that did not properly ignore the Select clause: Microsoft Access. Specifically, older versions of MS Access (I can't speak to current versions).

Prior to my discovery of this "feature", I used to use Exists( Select *.... However, i discovered that MS Access would stream across every column in the subquery and then discard them (Select 1/0 also would not work). That convinced me switch to Select 1. If even one DBMS was stupid, another could exist.

Writing Exists( Select 1... is as abundantly clear in conveying intent (It is frankly silly to claim "It's too close to saying "if 'no value' exists" for comfort.") and makes the odds of a DBMS doing something stupid with the Select statement nearly impossible. Select Null would serve the same purpose but is simply more characters to write.

I switched to Exists( Select 1 to make absolutely sure the DBMS couldn't be stupid. However, that was many moons ago, and today I would expect that most developers would expect seeing Exists( Select * which will work exactly the same.

That said, I can provide one good reason for avoiding Exists(Select * even if your DBMS evaluates it properly. It is much easier to find and trounce all uses of Select * if you don't have to skip every instance of its use in an Exists clause.

In SQL Server at least,

The smallest amount of data that can be read from disk is a single "page" of disk space. As soon as the processor reads one record that satisfies the subquery predicates it can stop. The subquery is not executed as though it was standing on it's own, and then included in the outer query, it is executed as part of the complete query plan for the whole thing. So when used as a subquery, it really doesn't matter what is in the Select clause, nothing is returned" to the outer query anyway, except a boolean to indicate whether a single record was found or not...

All three use the exact same execution plan

I always use [Select * From ... ] as I think it reads better, by not implying that I want something in particular returned from the subquery.

EDIT: From dave costa comment... Oracle also uses the same execution plan for all three options

This is one of those questions that verges on initiating some kind of holy war.

There's a fairly good discussion about it here.

I think the answer is probably to use the third option, but the speed increase is so infinitesimal it's really not worth worrying about. It's easily the kind of query that SQL Server can optimise internally anyway, so you may find that all options are equivalent.

The EXISTS returns a boolean not actual data, that said best practice is to use #3.

Execution Plan. Learn it, use it, love it

There is no possible way to guess, really.

In addition to what others have said, the practice of using SELECT 1 originated on old Microsoft SQL Server (prior 2005) - its query optimizer wasn't clever enough to avoid physically fetching fields from the table for SELECT *. No other DBMS, to my knowledge, has this deficiency.

The EXISTS tests for existence of rows, not what's in them, so other than some optimizer quirk similar to above, it doesn't really matter what's in the SELECT list.

The SELECT * seems to be most usual, but others are acceptable as well.

#3 Should be the best one, as you don´t need the returned data anyway. Bringing the fields will only add an extra overhead

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow