find groups with duplicate versions

https://dba.stackexchange.com/questions/281995

12-03-2021
|

Question

I want to find all the groups in a table that have duplicate group version. A group can have multiple group versions. Each group version can have multiple members. A group 'version' is defined by grpid and changeDate. A group consider a duplicate if ALL the members (userid, pct and hobby ) in one group version matches another group version within the same group.

https://dbfiddle.uk/?rdbms=mysql_5.7&fiddle=84eb81a1a71dcee9ad3d0bd91f56120a

The groups table:

id	grpid	changeDate	userid	pct	hobby	group version^*
1	1	2020-01-01	1	1	1	1
2	1	2020-01-02	1	1	2	2
3	1	2020-01-03	1	1	1	3
4	2	2020-01-01	1	0.5	1	4
5	2	2020-01-01	2	0.5	2	4
6	2	2020-01-02	1	0.5	1	5
7	2	2020-01-02	2	0.5	3	5
8	3	2020-01-01	1	0.5	1	6
9	3	2020-01-01	2	0.5	2	6
10	3	2020-01-02	1	0.4	1	7
11	3	2020-01-02	2	0.6	2	7
12	4	2020-01-01	1	0.6	1	8
13	4	2020-01-01	2	0.4	2	8
14	4	2020-01-02	1	0.6	1	9
15	4	2020-01-02	2	0.4	2	9
16	5	2020-01-01	1	0.2	2	10
17	5	2020-01-01	2	0.5	1	10
18	5	2020-01-01	3	0.3	2	10
19	6	2020-01-01	1	0.3	2	11
20	6	2020-01-01	2	0.5	1	11
21	6	2020-01-01	3	0.2	2	11
22	6	2020-02-01	1	0.2	2	12
23	6	2020-02-01	2	0.5	1	12
24	6	2020-02-01	3	0.3	2	12
25	6	2020-03-01	1	0.3	2	13
26	6	2020-03-01	2	0.3	1	13
27	6	2020-03-01	3	0.4	2	13
28	7	2020-01-01	1	0.3	2	14
29	7	2020-01-01	2	0.5	1	14
30	7	2020-01-01	3	0.2	2	14
31	7	2020-02-01	1	0.3	2	15
32	7	2020-02-01	2	0.5	1	15
33	7	2020-02-01	3	0.2	2	15
34	7	2020-03-01	1	0.3	2	16
35	7	2020-03-01	2	0.3	1	16
36	7	2020-03-01	3	0.4	2	16
37	8	2020-02-01	1	0.3	1	17
38	8	2020-03-01	1	0.3	1	18
39	8	2020-03-01	3	0.4	2	18

^{^*Unique group version number just for visualization.}

Result should be:

grpid
1
4
7

Explanation:

grpid 1 - there are 3 group versions (of 1 member) - 1 and 3 duplicate because userid, pct and hobby equal
grpid 2 - there are 2 group versions (of 2 members) - not duplicate because hobby not equal between 5 and 7
grpid 3 - there are 2 group versions (of 2 members) - not duplicate because pct is different among all members
grpid 4 - there are 2 group versions (of 2 members) - all members are duplicate because userid, pct and userid equal
grpid 5 - there is only one group of 3 members - not duplicate
grpid 6 - there are 3 group versions (of 3 members) - not duplicate - the pct changed for each member in the group between versions
grpid 7 - there are 3 group versions (of 3 members) - duplicate because userid, pct and userid equal between 28-30 and 31-33
grpid 8 - there are 2 group versions one w/ one member and one with 2 members - not duplicate because there is another member in that group

I am using MySQL 5.7.

Solution

Hopefully this should provide an answser:

This implements "common" relational division:

SELECT DISTINCT grpid
FROM groups AS g
GROUP BY grpid, changeDate
HAVING NOT EXISTS
    ( SELECT 1
      FROM groups AS gi
      WHERE gi.grpid = g.grpid
        AND gi.changeDate = g.changeDate
        AND NOT EXISTS
            ( SELECT 1 
               FROM groups AS gk
               WHERE gk.grpid = gi.grpid
                 AND gk.changeDate <> gi.changeDate
                 AND gk.userid = gi.userid
                 AND gk.pct = gi.pct
                 AND gk.hobby = gi.hobby
            )
    ) ;

Result:

grid
   1
   4
   7

This implements exact relational division, which results in even more complicated code. Your case, where the groupings/versions are determined by two columns (grpid, changeDate) makes it look like even more complicated.

Tested in dbffdle.uk

Query 1:

SELECT DISTINCT g1.grpid
FROM 
  ( SELECT grpid, changeDate
    FROM groups AS g
    GROUP BY grpid, changeDate
  ) AS g1
  JOIN
  ( SELECT grpid, changeDate
    FROM groups AS g
    GROUP BY grpid, changeDate
  ) AS g2
  ON  g1.grpid = g2.grpid
  AND g1.changeDate < g2.changeDate
WHERE NOT EXISTS
    ( SELECT 1
      FROM groups AS gi
      WHERE gi.grpid = g1.grpid
        AND gi.changeDate = g1.changeDate
        AND NOT EXISTS
            ( SELECT 1 
               FROM groups AS gk
               WHERE gk.grpid = g2.grpid
                 AND gk.changeDate = g2.changeDate
                 AND gk.userid = gi.userid
                 AND gk.pct = gi.pct
                 AND gk.hobby = gi.hobby
            )
    )
   AND NOT EXISTS
    ( SELECT 1
      FROM groups AS gi
      WHERE gi.grpid = g2.grpid
        AND gi.changeDate = g2.changeDate
        AND NOT EXISTS
            ( SELECT 1 
               FROM groups AS gk
               WHERE gk.grpid = g1.grpid
                 AND gk.changeDate = g1.changeDate
                 AND gk.userid = gi.userid
                 AND gk.pct = gi.pct
                 AND gk.hobby = gi.hobby
            )
    )
 ;

Query 2 that uses MySQL's GROUP_CONCAT function:

SELECT DISTINCT grpid
FROM
  ( SELECT grpid, changeDate, 
           GROUP_CONCAT( CONCAT_WS('-', userid, hobby, pct)
                         ORDER BY userid, hobby, pct
                         SEPARATOR '  '
                       ) AS groupdata 
    FROM groups AS gr
    GROUP BY grpid, changeDate
  ) AS g
GROUP BY grpid, groupdata
HAVING COUNT(*) > 1 ;

OTHER TIPS

You can use a inner join for that

Fyi mysql 8 duesn't like groups as table name, and must be put in backticks

And lines id = 2 and and id = 4, are also "identical, and fall under your rule

create table groups ( id int,  grpid int , changeDate date,  userid int,  pct double ,  hobby int );

insert into groups(id , grpid , changeDate  ,userid  ,pct  ,hobby ) values
(1  , 1      ,'2020-01-01', 1   ,    1   , 1  ),
(2  , 1      ,'2020-01-02', 1   ,    1   , 2  ),
(3  , 1      ,'2020-01-03', 1   ,    1   , 1  ),
(4  , 2      ,'2020-01-01', 1   ,    0.5 , 1  ),
(5  , 2      ,'2020-01-01', 1   ,    0.5 , 2  ),
(6  , 2      ,'2020-01-02', 1   ,    0.5 , 1  ),
(7  , 2      ,'2020-01-02', 1   ,    0.5 , 3  ),
(8  , 3      ,'2020-01-01', 1   ,    0.5 , 1  ),
(9  , 3      ,'2020-01-01', 1   ,    0.5 , 2  ),
(10 , 3      ,'2020-01-02', 1   ,    0.4 , 1  ),
(11 , 3      ,'2020-01-02', 1   ,    0.6 , 2  ),
(12 , 4      ,'2020-01-01', 1   ,    0.6 , 1  ),
(13 , 4      ,'2020-01-01', 1   ,    0.4 , 2  ),
(14 , 4      ,'2020-01-02', 1   ,    0.6 , 1  ),
(15 , 4      ,'2020-01-02', 1   ,    0.4 , 2  );

SELECT DISTINCT g1.grpid 
FROM `groups` g1 INNER JOIN  `groups` g2
ON g1.grpid = g2.grpid 
AND g1.hobby = g2.hobby 
AND g1.pct = g2.pct 
AND g1.userid = g2.userid
WHERE g1.id <> g2.id

| grpid |
| ----: |
|     1 |
|     2 |
|     4 |

db<>fiddle here

SELECT DISTINCT grpid
FROM TableWithGroups
GROUP BY grpid, userid, pct, hobby
HAVING COUNT(*) > 1

This uses the GROUP BY and HAVING clauses to respectively group up the values on the fields you specified to condense the duplicates into one row and filter where the count of rows condensed in each grouping was greater than 1 (in other words where there were duplicates).

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange