Question

A question that arose in a chat discussion:

I know hash join bailout switches internally to a sort of nested loops thing.

What does SQL Server do for a hash aggregate bailout (if it can happen at all)?

Was it helpful?

Solution

Hash join and hash aggregate both use the same operator code internally, though a hash aggregate uses only a single (build) input. The basic operation of hash aggregate is described by Craig Freedman:

As with hash join, the hash aggregate requires memory. Before executing a query with a hash aggregate, SQL Server uses cardinality estimates to estimate how much memory we need to execute the query. With a hash join, we store each build row, so the total memory requirement is proportional to the number and size of the build rows. The number of rows that join and the output cardinality of the join have no effect on the memory requirement of the join. With a hash aggregate, we store one row for each group, so the total memory requirement is actually proportional to the number and size of the output groups or rows. If we have fewer unique values of the group by column(s) and fewer groups, we need less memory. If we have more unique values of the group by column(s) and more groups, we need more memory.

He goes on to talk about hash recursion:

So, what happens if we run out of memory? Again, like hash join, if we run out of memory, we must begin spilling rows to tempdb. We spill one or more buckets or partitions including any partially aggregated results along with any additional new rows that hash to the spilled buckets or partitions. Although we do not attempt to aggregate the spilled new rows, we do hash them and divide them up into several buckets or partitions. Once we’ve finished processing all input groups, we output the completed in-memory groups and repeat the algorithm by reading back and aggregating one spilled partition at a time. By dividing the spilled rows into multiple partitions, we reduce the size of each partition and, thus, reduce the risk that the algorithm will need to repeat many times.

Bailout

Hash bailout is lightly documented, but mentioned by Nacho Alonso Portillo in What’s the maximum level of recursion for the hash iterator before forcing bail-out?

The value is a constant, hard coded in the product, and its value is five (5). This means that before the hash scan operator resorts to a sort based algorithm for any given subpartition that doesn’t fit into the granted memory from the workspace, five previous attempts to subdivide the original partition into smaller partitions must have happened.

The "hash scan operator" mentioned there is a reference to the internal class CQScanHash in sqlmin.dll. This class heads the implementation of the hash operator (in all its forms, including partial aggregates and flow distinct) we see in execution plans.

Bailout algorithm

This brings us to the heart of your questions - what exactly does the bailout algorithm do? Is it "sort based" or based on "a sort of nested loops thing"?

It is arguably both, depending on your point of view. When hash recursion reaches level 5, the in-memory hash partition changes from being a hash table to an initially empty b-tree index on the hash values. Each row from a single previously-spilled hash partition is looked up in the b-tree index and inserted (new group) or updated (maintaining aggregates) as appropriate.

This series of unordered inserts to a b-tree can equally well be seen as an insertion sort or as an indexed nested loops lookup.

In any case, this fallback algorithm is guaranteed to complete eventually without allocating more memory. It may require multiple passes if the space available for the b-tree is not sufficient to hold all the grouping keys and aggregates from the overflow partition.

Once the memory available to hold the b-tree index is exhausted, any further rows (from the current spilled partition) are sent to a single new tempdb partition (which is guaranteed to be smaller) and the process repeats as necessary. The spill level remains at 5 because hash recursion has ended. Some processing details can be observed with undocumented trace flag 7357.

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top