Question

In the book Querying MS SQL Server 2012 (Training Kit) for exam 70-461 it says about GROUP BY phase of a query in logical processing :

The final result of this query has one row representing each group (unless filtered out). Therefore, expressions in all phases that take place after the current grouping phase are somewhat limited. All expressions processed in subsequent phases must guarantee a single value per group. If you refer to an element from the GROUP BY list (for example, country), you already have such a guarantee, so such a reference is allowed. However, if you want to refer to an element that is not part of your GROUP BY list (for example, empid), it must be contained within an aggregate function like MAX or SUM. That’s because multiple values are possible in the element within a single group, and the only way to guarantee that just one will be returned is to aggregate the values.

The author then mentions the HAVING step where he uses COUNT(*) > 1. My question is if the GROUP BY only has a result of 1 row per group, how is the HAVING phase using that single group row to filter out any groups that have more than 1 row... which it does as half of them remain? So am I missing something here. Is there some sort of hidden COUNT column attached to each group?

The query is:

SELECT country, YEAR(hiredate) AS yearhired, COUNT(*) AS numemployees
FROM HR.Employees
WHERE hiredate >= '20030101'
GROUP BY country, YEAR(hiredate)
HAVING COUNT(*) > 1
ORDER BY country , yearhired DESC;

Please englighten.

Was it helpful?

Solution

Where the author refers to one row per group in the GROUP BY he is referring to the result set, then when referring to the rows per group in HAVING he is referring to the input.

Imagine this simple data set

Col1    Col2    Value
----------------------
  a       a       1
  a       b       1
  a       b       1
  a       b       2
  a       c       1
  a       c       5

As you can see there are 3 different tuples for (Col1, Col2) -- (a, a), (a, b), (a, c), therefore if you GROUP BY Col1, Col2 you will get three rows in your result (One per group).

SELECT  Col1, Col2
FROM    T
GROUP BY Col1, Col2;

Gives

Col1    Col2    
-------------
  a       a   
  a       b   
  a       c  

This is what the author is referring to when saying "one row per group".

However, expanding again you can see that there are two rows with the tuple (a, b), and two for (a, c) - so there are two input rows for each, this is what the COUNT(*) is referring to, not the number of rows in the result set.

Any aggregate functions (either in the having or in the select) are calculated at the same time as the GROUP BY, not at their respective parts (HAVING, SELECT). They are the same operation, this is how it maintains knowledge of the number of rows in the group before they are used in the select or having.

There is very good answer on Stackoverflow explaining how aggregates work behind the scenes for further reading, so I won't repeat it here.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top