What is the difference between group by, distinct, Union for selecting distinct values for multiple columns?

https://stackoverflow.com/questions/1271643

13-09-2019
|

Question

This question explained about a way of getting distinct combination of multiple columns. But I want to know the difference between the methods of DISTINCT, UNION, GROUP BY keyword method for this purpose. I am getting different results when using them. My queries are like this Query 1.

select 
column1,
column2,
column3
from table
group by 1,2,3

Query 2.

select distinct 
column1,
column2,
column3
from table

Query 3.

SELECT DISTINCT(ans) FROM (
    SELECT column1 AS ans FROM sametable
    UNION
    SELECT column2 AS ans FROM sametable
    UNION
    SELECT column3 AS ans FROM sametable
) AS Temp

I am getting different number of rows for above queries(Edit: The first two are giving equal number of rows but last one is giving differnetly). Can any body explain what the above queries are doing? Especially the third one?

EDIT: Note that I am doing UNION on same table. In that case what will happen?

Solution

Starting with what I think is the simplest, DISTINCT, really is just that. It returns the distinct combinations of rows. Think of this dataset:

COL1      COL2      COL3
A         B         C
D         E         F
G         H         I
A         B         C   <- duplicate of row 1

This will return 3 rows because the 4th row in the dataset exactly matches the first row. Result:

COL1      COL2      COL3
A         B         C
D         E         F
G         H         I

The GROUP BY is frequently used for summaries and other calculations select COL1, SUM(COL2) from table group by column1;

For this dataset:

COL1      COL2
A         5
A         6
B         2
C         3
C         4
C         5

would return

COL1     SUM(COL2)
A        11
B        2
C        12

a UNION just takes results from different queries and presents them as 1 result set:

Table1
COL1
A

Table2
COLX
B

Table3
WHATEVER_COLUMN_NAME
Giddyup

select COL1 from Table1
UNION
select COLX from Table2
UNION 
select WHATEVER_COLUMN_NAME from Table3;

Result Set:

A
B
Giddyup

When performing a union, the column datatypes must match up. You can't UNION a number column with a char column (unless you explicitly perform a data conversion)

OTHER TIPS

Lets assume this is your db data:

column1 | column2 | column3
1       | 2       | 1
1       | 2       | 2
1       | 2       | 1
3       | 1       | 2
1       | 2       | 2
1       | 2       | 2
1       | 2       | 2

First query

In the first example you will get all column combinations from the db (as GROUP BY 1,2,3 does nothing) including duplicates, so it will return:

1       | 2       | 1
1       | 2       | 2
1       | 2       | 1
3       | 1       | 2
1       | 2       | 2
1       | 2       | 2
1       | 2       | 2

2nd query

Second example takes unique values for column tuples so you will end with

1       | 2       | 1
1       | 2       | 2
3       | 1       | 2

3rd query

Last query takes all values from three columns and then it removes duplicates from that set. So you will get all values from any of the tables. In the end this will return

1
2
3

Does this makes it clear?

Lets go with a sample set of data

orderid    customer orderdate
1          B        July 29
2          A        Aug 1
3          A        Aug 4
4          C        Aug 5
5          B        Aug 6
6          A        Aug 11

Distinct basically returns a single instance of a given record with no duplicates of the entire set of columns in the result set. Ex: "select distinct customer from orders" would return "A", "B", "C" defaulted in alpha order of column(s) chosen.

Group by is to do aggregations within a given set of fields in a query. Ex:

select customer, count(*) as NumberOfOrders from Orders group by 1

Would result with...
A    3
B    2
C    1

You can also apply distinct (only once), within a query, but within a given group..

select customer, count(*) as NumberOfOrders, count( distinct {month of orderdate} ) as CustomerMonths from orders group by customer

Would result with
A    3    1  (all orders were in August)
B    2    2  (had orders in July and August)
C    1    1  (only one order in August)

Unions are queries that must be the exact same result format, column names and sequence of fields. Lets say you have an orders table that is the exact same structure as an archived version of data too. You only keep current data over the most current year, all historical is pushed to archive. If you wanted to get ALL order activity for a given customer in one query, you would want to do a union

select customerid, orderdate, amount from CurrentOrders where customerid = ?? order by 2 descending UNION select customerid, orderdate, amount from ArchivedOrders where customerid = ??

The ORDER by clause of the first select will drive the results all all subsequent records being pulled into the results. Its like SQL saying go to table one, get all that qualify, then sort. Then, go to table two, get all that qualify there and pull into the existing sorted list from table one. Final result is ALL records.

HTH

If you include "Actual Execution Plan" (control + M in MS SQL Management Studio), it will give you a diagram of how the SQL engine optimises each of your statements. Understanding this will help you write better queries.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow