SELECT DISTINCT for data groups

https://stackoverflow.com/questions/4275918

28-09-2019
|

Question

I have following table:

ID  Data
1   A
2   A
2   B
3   A
3   B
4   C
5   D 
6   A
6   B

etc. In other words, I have groups of data per ID. You will notice that the data group (A, B) occurs multiple times. I want a query that can identify the distinct data groups and number them, such as:

DataID     Data
101        A
102        A
102        B
103        C
104        D

So DataID 102 would resemble data (A,B), DataID 103 would resemble data (C), etc. In order to be able to rewrite my original table in this form:

ID   DataID
1    101
2    102
3    102
4    103
5    104
6    102

How can I do that?

PS. Code to generate the first table:

CREATE TABLE #t1 (id INT, data VARCHAR(10))
INSERT INTO #t1
SELECT 1, 'A'
UNION ALL SELECT 2, 'A'
UNION ALL SELECT 2, 'B'
UNION ALL SELECT 3, 'A'
UNION ALL SELECT 3, 'B'
UNION ALL SELECT 4, 'C'
UNION ALL SELECT 5, 'D'
UNION ALL SELECT 6, 'A'
UNION ALL SELECT 6, 'B'

Solution

In my opinion You have to create a custom aggregate that concatenates data (in case of strings CLR approach is recommended for perf reasons). Then I would group by ID and select distinct from the grouping, adding a row_number()function or add a dense_rank() your choice. Anyway it should look like this

with groupings as (
select concat(data) groups
from Table1
group by ID
)
select groups, rownumber() over () from groupings

OTHER TIPS

The following query using CASE will give you the result shown below.

From there on, getting the distinct datagroups and proceeding further should not really be a problem.

SELECT     
    id, 
     MAX(CASE data WHEN 'A' THEN data ELSE '' END) + 
     MAX(CASE data WHEN 'B' THEN data ELSE '' END) + 
     MAX(CASE data WHEN 'C' THEN data ELSE '' END) + 
     MAX(CASE data WHEN 'D' THEN data ELSE '' END) AS DataGroups
FROM  t1
GROUP BY id

ID  DataGroups
1   A
2   AB
3   AB
4   C
5   D
6   AB

However, this kind of logic will only work in case you the "Data" values are both fixed and known before hand.

In your case, you do say that is the case. However, considering that you also say that they are 1000 of them, this will be frankly, a ridiculous looking query for sure :-)

LuckyLuke's suggestion above would, frankly, be the more generic way and probably saner way to go about implementing the solution though in your case.

From your sample data (having added the missing 2,'A' tuple, the following gives the renumbered (and uniqueified) data:

with NonDups as (
select t1.id
from #t1 t1 left join #t1 t2
on t1.id > t2.id and t1.data = t2.data
group by t1.id
having COUNT(t1.data) > COUNT(t2.data)
), DataAddedBack as (
    select ID,data
    from #t1 where id in (select id from NonDups)
), Renumbered as (
    select DENSE_RANK() OVER (ORDER BY id) as ID,Data from DataAddedBack
)
select * from Renumbered

Giving:

1          A
2          A
2          B
3          C
4          D

I think then, it's a matter of relational division to match up rows from this output with the rows in the original table.

Just to share my own dirty solution that I'm using for the moment:

SELECT DISTINCT t1.id, D.data
FROM #t1 t1
CROSS APPLY ( 
    SELECT CAST(Data AS VARCHAR) + ','
    FROM #t1 t2
    WHERE t2.id = t1.id
    ORDER BY Data ASC
    FOR XML PATH('') )  
D ( Data )

And then going analog to LuckyLuke's solution.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow