Why does SQL force me to repeat all non-aggregated fields from my SELECT clause in my GROUP BY clause? [closed]

StackOverflow https://stackoverflow.com/questions/416625

  •  03-07-2019
  •  | 
  •  

Question

This has bugged me for a long time.

99% of the time, the GROUP BY clause is an exact copy of the SELECT clause, minus the aggregate functions (MAX, SUM, etc.).
This breaks the Don't Repeat Yourself principle.

When can the GROUP BY clause not contain an exact copy of the SELECT clause minus the aggregate functions?

edit

I realise that some implementations allow you to have different fields in the GROUP BY than in the SELECT (hence 99%, not 100%), but surely that's a very minor exception?
Can someone explain what is supposed to be returned if you use different fields?

Thanks.

Was it helpful?

Solution

I tend to agree with you - this is one of many cases where SQL should have slightly smarter defaults to save us all some typing. For example, imagine if this were legal:

Select ClientName, InvoiceAmount, Sum(PaymentAmount) Group By *

where "*" meant "all the non-aggregate fields". If everybody knew that's how it worked, then there would be no confusion. You could sub in a specific list of fields if you wanted to do something tricky, but the splat means "all of 'em" (which in this context means, all the possible ones).

Granted, "*" means something different here than in the SELECT clause, so maybe a different character would work better:

Select ClientName, InvoiceAmount, Sum(PaymentAmount) Group By !

There are a few other areas like that where SQL just isn't as eloquent as it could be. But at this point, it's probably too entrenched to make many big changes like that.

OTHER TIPS

Because they are two different things, you can group by items that aren't in the select clause

EDIT:

Also, is it safe to make that assumption?

I have a SQL statement

Select ClientName, InvAmt, Sum(PayAmt) as PayTot

Is it "correct" for the server to assume I want to group by ClientName AND InvoiceAmount? I personally prefer (and think it's safer) to have this code

Select ClientName, InvAmt, Sum(PayAmt) as PayTot
Group By ClientName

throw an error, prompting me to change the code to

Select ClientName, Sum(InvAmt) as InvTot, Sum(PayAmt) as PayTot
Group By ClientName

I hope/expect we'll see something more comprehensive soon; a SQL history lesson on the subject would be useful and informative. Anyone? Anyone? Bueller?

In the meantime, I can observe the following:

SQL predates the DRY principle, at least as far as it it was documented in The Pragmatic Programmer.

Not all DBs require the full list: Sybase, for example, will happily execute queries like

SELECT a, b, COUNT(*)
FROM some_table
GROUP BY a

... which (at least every time I accidentally ran such a monster) often leads to such enormous inadvertent recordsets that panic-stricken requests quickly ensue, begging the DBAs to bounce the server. The result is a sort of partial Cartesian product, but I think it may mostly be a failure on Sybase's part to implement the SQL standard properly.

Perhaps we need a shorthand form - call it GroupSelect

GroupSelect Field1, Field2, sum(Field3) From SomeTable Where (X = "3")

This way, the parser need only throw an error if you leave out an aggregate function.

The good reason for it is that you would get incorrect results more often than not if you did not specify all columns. Suppose you have three columns, col1, col2 and col3.

Suppose your data looks like this:

Col1  Col2 Col3
a      b    1
a      c    1
b      b    2
a      b    3

select col1, col2, sum(col3) from mytable group by col1, col2
would give the following results:

Col1  Col2 Col3
a      b    4
a      c    1
b      b    2

How would it interpret
select col1, col2, sum(col3) from mytable group by col1

My guess would be

Col1  Col2 Col3
a      b    5
a      c    5
b      b    2

These are clearly bad results. Of course the more complex the query and the more joins the less likely it would be that the query would return correct results or that the programmer would even know if they were incorrect.

Personally I'm glad that group by requires the fields.

I agree with GROUP BY ALL, GROUP BY *, or something similar. As mentioned in the original post, in 99% (perhaps more) of the cases you want to group by all non-aggregate columns/expressions.

Here is however one example where you would need GROUP BY columns, for backward compatibility reasons.

SELECT 
  MIN(COUNT(*)) min_same_combination_cnt, 
  MAX(COUNT(*)) max_same_comb_cnt, 
  AVG(COUNT(*)) avg_same_comb_cnt, 
  SUM(COUNT(*)) total_records,
  COUNT(COUNT(*)) distinct_combinations_cnt
FROM <some table>
GROUP BY <list of columns>

This works in Oracle. I use it to estimate selectivity on columns. The group by is applied to the inner aggregate function. Then, the outer aggregate is applied.

It would be nice to put forward a suggestion for this improvement to the SQL standard. I just don't know how that works.

Actually, wouldn't that be 100% of the time? Is there a case in which you can have a (non-aggregate) column in the select that is not in the GROUP BY?

I don't have an answer though. It certainly does seem like a awkward moment for the language.

I share the op's view that repeating is a bit annoying, especially if the non-aggregate fields contain elaborate statements like ifs and functions and a whole lot of other things. It would be nice if there could be some shorthand in the group by clause - at least a column alias. Referring to the columns by number may be another option, albeit one that probably has their own problems.

There could be a situation that you needed to extract one id of all the rows grouped, and sum of their quantities - for example. In this case you would i.e. group them by name and leave ids not grouped. SQLite seems to work this way.

Since group by result in single tuple for a whole group of tuples so other non group by attributes must be used in aggregate function only. If u add non group by attribute in select then sql cant decide which which value to be select from that group.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top