Optimization of multiple aggregations in SELECT

https://stackoverflow.com/questions/8002456

21-02-2021
|

Question

I read in a Microsoft T-SQL Performance Tuning whitepaper that correlated sub-queries can be costly in terms of performance on a large table:

...Compare this to the first solution that would scan the whole table and execute a correlated subquery for every row. The difference in performance is negligible on a small table. But on a large table it may amount to hours of processing time...

Is there a general way to convert a query with several aggregations based upon different criteria as correlated sub-queries into a single query that uses JOINs instead of correlated sub-queries?

Consider an example:

Prepare the schema:

CREATE TABLE Student (
    ID INT NOT NULL PRIMARY KEY IDENTITY(1,1),
    Name NVARCHAR(255) NOT NULL
);

CREATE TABLE Grade (
    ID INT NOT NULL PRIMARY KEY IDENTITY(1,1),
    StudentID INT NOT NULL FOREIGN KEY REFERENCES Student(ID),
    Score INT NOT NULL,
    CONSTRAINT CK_Grade_Score CHECK (Score >= 0 AND Score <= 100)
);

INSERT INTO Student (Name) VALUES ('Steven');
INSERT INTO Student (Name) VALUES ('Timmy');
INSERT INTO Student (Name) VALUES ('Maria');
 
INSERT INTO Grade (StudentID, Score) VALUES (1, 90);
INSERT INTO Grade (StudentID, Score) VALUES (1, 81);
INSERT INTO Grade (StudentID, Score) VALUES (1, 82);
INSERT INTO Grade (StudentID, Score) VALUES (1, 82);

INSERT INTO Grade (StudentID, Score) VALUES (2, 99);
INSERT INTO Grade (StudentID, Score) VALUES (2, 63);
INSERT INTO Grade (StudentID, Score) VALUES (2, 97);
INSERT INTO Grade (StudentID, Score) VALUES (2, 90);

INSERT INTO Grade (StudentID, Score) VALUES (3, 66);
INSERT INTO Grade (StudentID, Score) VALUES (3, 61);
INSERT INTO Grade (StudentID, Score) VALUES (3, 60);

The query in question:

SELECT Name,
    (SELECT AVG(Score) FROM Grade WHERE StudentID = Student.ID AND Score < 65) AS 'F',
    (SELECT AVG(Score) FROM Grade WHERE StudentID = Student.ID AND Score >= 65 AND Score < 70) AS 'D',
    (SELECT AVG(Score) FROM Grade WHERE StudentID = Student.ID AND Score >= 70 AND Score < 80) AS 'C',
    (SELECT AVG(Score) FROM Grade WHERE StudentID = Student.ID AND Score >= 80 AND Score < 90) AS 'B',
    (SELECT AVG(Score) FROM Grade WHERE StudentID = Student.ID AND Score >= 90 AND Score <= 100) AS 'A'
FROM Student

Produces the following result:

Name    F     D     C     B     A
-----------------------------------------
Steven  NULL  NULL  NULL  81    90
Timmy   63    NULL  NULL  NULL  95
Maria   60    66    NULL  NULL  NULL

I am aware of the technique that you can use with COUNT() where you perform a single SELECT with a JOIN and then use a CASE statement to optionally add 1 to a counter when the primary keys line up between your join AND your condition is true. I am looking for a similar sort of technique that can be applied for different types of aggregations (as opposed to just COUNT).

Is there an effective way to convert this example query to use a `JOIN` instead of multiple sub-queries?

Solution

Maybe I'm missing something, but the solution using a CASE does work for aggregates as well:

SELECT st.name, 
       avg(CASE WHEN g.score < 65 THEN g.score ELSE NULL END) as F,
       avg(CASE WHEN g.score >= 65 AND g.score < 70 THEN g.score ELSE NULL END) as D,
       avg(CASE WHEN g.score >= 70 AND g.score < 80 THEN g.score ELSE NULL END) as C,
       avg(CASE WHEN g.score >= 80 AND g.score < 90 THEN g.score ELSE NULL END) as B,
       avg(CASE WHEN g.score >= 90 AND g.score <= 100 THEN g.score ELSE NULL END) as A
FROM Grade g
  JOIN Student st ON g.studentid = st.ID
GROUP BY st.name

OTHER TIPS

I tried something like the following, using CTE, but the result is a bit different from what you got, cause it calculates the Average over all grades:

;WITH
Scores(ID,Score) AS(
    SELECT S.ID,AVG(Score)
    FROM Student S
    JOIN Grade G
        ON S.ID = G.StudentID
    GROUP BY S.ID)

SELECT ST.Name
    ,CASE WHEN S.Score  < 65 THEN S.Score ELSE NULL END AS 'F'
    ,CASE WHEN S.Score  BETWEEN 65 AND 70 THEN S.Score ELSE NULL END AS 'D'
    ,CASE WHEN S.Score  BETWEEN 70 AND 80 THEN S.Score ELSE NULL END AS 'C'
    ,CASE WHEN S.Score  BETWEEN 80 AND 90 THEN S.Score ELSE NULL END AS 'B'
    ,CASE WHEN S.Score  BETWEEN 90 AND 100 THEN S.Score ELSE NULL END AS 'A'
FROM Scores S
JOIN Student ST
    ON S.ID = ST.ID

Try this:

SELECT s.Name
    ,SUM(CASE Score_g WHEN 'F' THEN Score_avg END) as 'F'
    ,SUM(CASE Score_g WHEN 'D' THEN Score_avg END) as 'D'
    ,SUM(CASE Score_g WHEN 'C' THEN Score_avg END) as 'C'
    ,SUM(CASE Score_g WHEN 'B' THEN Score_avg END) as 'B'
    ,SUM(CASE Score_g WHEN 'A' THEN Score_avg END) as 'A'
FROM Student s,
     (
      SELECT StudentId, score_g, avg(score) as score_avg
      FROM  (
            SELECT StudentID, Score
            CASE
              WHEN Score < 65                   THEN 'F'
              WHEN Score >= 65 AND Score < 70   THEN 'D'
              WHEN Score >= 70 AND Score < 80   THEN 'C'
              WHEN Score >= 80 AND Score < 90     THEN 'B'
              WHEN Score >= 90 AND Score <= 100 THEN 'A'
              ELSE 'X'
            END AS Score_g
            FROM Grade
        ) g
       GROUP BY StudentId, score_g
    ) t
WHERE s.ID = t.StudentID
GROUP BY s.Name

If you really hate subqueries, you can use:

SELECT s.name
    ,AVG(CASE WHEN Score < 65                   THEN SCORE END) AS 'F'
    ,AVG(CASE WHEN Score >= 65 AND Score < 70   THEN SCORE  END) AS 'D'
    ,AVG(CASE WHEN Score >= 70 AND Score < 80   THEN SCORE  END) AS 'C'
    ,AVG(CASE WHEN Score >= 80 AND Score < 90     THEN SCORE END) AS 'B'
    ,AVG(CASE WHEN Score >= 90 AND Score <= 100 THEN SCORE  END) AS 'A'
FROM Grade g, Student s
WHERE g.StudentID = s.ID
GROUP BY s.name

But in this case Student table have to contain unique entries for one student entity.

If your DBMS supports PIVOT, you could also try something like this:

;WITH marked AS (
  SELECT
    StudentID,
    Score,
    Mark = CASE
      WHEN Score < 65 THEN 'F'
      WHEN Score < 70 THEN 'D'
      WHEN Score < 80 THEN 'C'
      WHEN Score < 90 THEN 'B'
      ELSE 'A'
    END
  FROM Grade
),
pivoted AS (
  SELECT
    StudentID,
    F, D, C, B, A
  FROM marked m
  PIVOT (
    AVG(Score) FOR Mark IN (F, D, C, B, A)
  ) p
)
SELECT
  s.Name,
  p.F,
  p.D,
  p.C,
  p.B,
  p.A
FROM Student s
  INNER JOIN pivoted p ON s.ID = p.StudentID

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow

Optimization of multiple aggregations in SELECT

Is there an effective way to convert this example query to use a JOIN instead of multiple sub-queries?

Is there an effective way to convert this example query to use a `JOIN` instead of multiple sub-queries?