Question

I'm trying to perform an aggregation over a large set of manually "partitioned" tables. I can do it with T-SQL of the following style.

SELECT A, B, C, COUNT(*)
FROM
(
    SELECT ...
UNION ALL
    SELECT ...
UNION ALL
    SELECT ...
-- and many more!
) X
GROUP BY A, B, C

My problem / concern is that SQL Server seems to kick off ALL of nested SELECTs simultaneously. I'm wondering if there is any pattern that would be able to have the nested SELECTs run one after the other, to reduce the resource contention on the server.

What I DON'T want (and suspect is happening) is that all of the sub-SELECTs are running in parallel and most of them are having their output buffered (not sure how to prove this though).

The best I can come up with so far is an explicit temp table (or table valued variable) that would have the output from each one written to it independent, and then aggregate that, but that would result in many more rows being materialised than really need to be: I WANT the output to stream into the aggregation process that will effectively mean that not much intermediate storage will be required.

(The nested selects are actually quite complex self-joins, but which have been crafted to result in a merge join so that the minimum of intermediate results needs to be in memory or paged.)

Does anyone know of a better pattern for achieving this?

Was it helpful?

Solution 2

Putting @i-one and @t-clausen.dk together + a MERGE ended up being the best answer for me:

SELECT A, B, C, COUNT(*) cnt
INTO #tmp
FROM ...
GROUP BY A,B,C

ALTER TABLE #tmp ADD CONSTRAINT pk_#tmp PRIMARY KEY CLUSTERED (A,B,C)

MERGE INTO #tmp X
USING
(
    SELECT A, B, C, COUNT(*) cnt
    FROM ...
    GROUP BY A,B,C
) I
ON X.A = I.A AND X.B=I.B AND X.C=I.C
WHEN MATCHED THEN UPDATE SET X.cnt= X.cnt + I.cnt
WHEN NOT MATCHED THEN INSERT (A, B, C, cnt)
    VALUES (I.A, I.B, I.C, I.cnt);

-- repeat for more 

SELECT * FROM #tmp

NOTE: This was best FOR ME. The high row-counts inside each of individual SELECTs made this approach worthwhile. Your mileage may vary.

I still consider SQL Server to be pretty dumb in the way that it seems to over-commit resources by running each part of the UNION ALL in parallel and REQUIRING a work-around such as this. Oh well...

OTHER TIPS

I imagine this could run faster. Not sure though

SELECT A, B, C, sum(cnt)
FROM
(
    SELECT A, B, C, COUNT(*) cnt
    FROM ...
    GROUP BY A,B,C
  UNION ALL
    SELECT ...
    FROM ...
    GROUP BY A,B,C
  UNION ALL
    SELECT ...
    FROM ...
    GROUP BY A,B,C
  -- and many more!
) X
GROUP BY A, B, C
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top