advantages in specifying HASH JOIN over just doing a JOIN?

https://stackoverflow.com/questions/10717549

10-06-2021
|

質問

What are the advantages, if any, of explicitly doing a HASH JOIN over a regular JOIN (wherein SQL Server will decide the best JOIN strategy)? Eg:

select pd.*
from profiledata pd
inner hash join profiledatavalue val on val.profiledataid=pd.id

In the simplistic sample code above, I'm specifying the JOIN strategy, whereas if I leave off the "hash" key word SQL Server will do a MERGE JOIN behind the scenes (per the "actual execution plan").

解決

The optmiser does a good enough job for everyday use. However, in theory it might need 3 weeks to find the perfect plan in the extreme, so there is a chance that the generated plan will not be ideal.

I'd leave it alone unless you have a very complex query or huge amounts of data where it simply can't produce a good plan. Then I'd consider it.

But over time, as data changes/grows or indexes change etc, your JOIN hint will becomes obsolete and prevents an optimal plan. A JOIN hint can only optimise for that single query at the time of development with that set of data you have.

Personally, I've never specified a JOIN hint in any production code.

I've normally solved a bad join by changing my query around, adding/changing an index or breaking it up (eg load a temp table first). Or my query was just wrong, or I had an implicit data type conversion, or it highlighted a flaw in my schema etc.

I've seen other developers use them but only where they had complex views nested upon complex views and they caused later problems when they refactored.

Edit:

I had a conversion today where some colleagues are going to use them to force a bad query plan (with NOLOCK and MAXDOP 1) to "encourage" migration away from legacy complex nested views that one of their downstream system calls directly.

他のヒント

When to try a hash hint, how about:

After checking that adequate indices exist on at least one of the tables.
After having tried to re-arrange the query. Things like converting joins to "in" or "exists", changing join order (which is only really a hint anyway), moving logic from where clause to join condition, etc.

Some basic rules about when a hash join is effective is when a join condition does not exist as a table index and when the tables sizes are different. If you looking for a technical description there are some good descriptions out there about how a hash join works.

Why use any join hints (hash/merge/loop with side effect of force order)?

To avoid extremely slow execution (.5 -> 10.0s) of corner cases.
When the optimizer consistently chooses a mediocre plan.

A supplied hint is likely to be non-ideal for some circumstances but provides more consistently predictable runtimes. The expected worst case and best case scenarios should be pre-tested when using a hint. Predictable runtimes are critical for web services where a rigidly optimized nominal [.3s, .6s] query is preferred over one that can range [.25, 10.0s] for example. Large runtime variances can happen with statistics freshly updated and best practices followed.

When testing in a development environment, one should turn off "cheating" as well to avoid hot/cold runtime variances. From another post...

CHECKPOINT -- flushes dirty pages to disk
DBCC DROPCLEANBUFFERS -- clears data cache
DBCC FREEPROCCACHE -- clears execution plan cache

The last option may be the same as the option(recompile) hint.

The MAXDOP and loading of the machine can also make a huge difference in runtime. Materialization of CTE into temp tables is also a good locking down mechanism and something to consider.

Hash joins parallelize and scale better than any other join and are great at maximizing throughput in data warehouses.

The only hint I've ever seen in shipping code was OPTION (FORCE ORDER). Stupid bug in SQL query optimizer would generate a plan that tried to join an unfiltered varchar and a unique identifier. Adding FORCE ORDER caused it to run the filter first.

I know, overloading columns is bad. Sometimes, you've got to live with it.

The logical plan optimizator doesn't assure to you that it finds the optimal solution: an exact algorithm is too slow to use in a production server; instead there are used some greedy algorithms.

Hence, the rationale behind those commands is to let the user specify the optimal join strategy, in the case the optimizator can't sort out what's really the best to adopt.

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow