find groups with duplicate versions
Question
I want to find all the groups in a table that have duplicate group version.
A group can have multiple group versions.
Each group version can have multiple members.
A group 'version' is defined by grpid
and changeDate
.
A group consider a duplicate if ALL the members (userid
, pct
and hobby
) in one group version matches another group version within the same group.
https://dbfiddle.uk/?rdbms=mysql_5.7&fiddle=84eb81a1a71dcee9ad3d0bd91f56120a
The groups
table:
id | grpid | changeDate | userid | pct | hobby | group version* |
---|---|---|---|---|---|---|
1 | 1 | 2020-01-01 | 1 | 1 | 1 | 1 |
2 | 1 | 2020-01-02 | 1 | 1 | 2 | 2 |
3 | 1 | 2020-01-03 | 1 | 1 | 1 | 3 |
4 | 2 | 2020-01-01 | 1 | 0.5 | 1 | 4 |
5 | 2 | 2020-01-01 | 2 | 0.5 | 2 | 4 |
6 | 2 | 2020-01-02 | 1 | 0.5 | 1 | 5 |
7 | 2 | 2020-01-02 | 2 | 0.5 | 3 | 5 |
8 | 3 | 2020-01-01 | 1 | 0.5 | 1 | 6 |
9 | 3 | 2020-01-01 | 2 | 0.5 | 2 | 6 |
10 | 3 | 2020-01-02 | 1 | 0.4 | 1 | 7 |
11 | 3 | 2020-01-02 | 2 | 0.6 | 2 | 7 |
12 | 4 | 2020-01-01 | 1 | 0.6 | 1 | 8 |
13 | 4 | 2020-01-01 | 2 | 0.4 | 2 | 8 |
14 | 4 | 2020-01-02 | 1 | 0.6 | 1 | 9 |
15 | 4 | 2020-01-02 | 2 | 0.4 | 2 | 9 |
16 | 5 | 2020-01-01 | 1 | 0.2 | 2 | 10 |
17 | 5 | 2020-01-01 | 2 | 0.5 | 1 | 10 |
18 | 5 | 2020-01-01 | 3 | 0.3 | 2 | 10 |
19 | 6 | 2020-01-01 | 1 | 0.3 | 2 | 11 |
20 | 6 | 2020-01-01 | 2 | 0.5 | 1 | 11 |
21 | 6 | 2020-01-01 | 3 | 0.2 | 2 | 11 |
22 | 6 | 2020-02-01 | 1 | 0.2 | 2 | 12 |
23 | 6 | 2020-02-01 | 2 | 0.5 | 1 | 12 |
24 | 6 | 2020-02-01 | 3 | 0.3 | 2 | 12 |
25 | 6 | 2020-03-01 | 1 | 0.3 | 2 | 13 |
26 | 6 | 2020-03-01 | 2 | 0.3 | 1 | 13 |
27 | 6 | 2020-03-01 | 3 | 0.4 | 2 | 13 |
28 | 7 | 2020-01-01 | 1 | 0.3 | 2 | 14 |
29 | 7 | 2020-01-01 | 2 | 0.5 | 1 | 14 |
30 | 7 | 2020-01-01 | 3 | 0.2 | 2 | 14 |
31 | 7 | 2020-02-01 | 1 | 0.3 | 2 | 15 |
32 | 7 | 2020-02-01 | 2 | 0.5 | 1 | 15 |
33 | 7 | 2020-02-01 | 3 | 0.2 | 2 | 15 |
34 | 7 | 2020-03-01 | 1 | 0.3 | 2 | 16 |
35 | 7 | 2020-03-01 | 2 | 0.3 | 1 | 16 |
36 | 7 | 2020-03-01 | 3 | 0.4 | 2 | 16 |
37 | 8 | 2020-02-01 | 1 | 0.3 | 1 | 17 |
38 | 8 | 2020-03-01 | 1 | 0.3 | 1 | 18 |
39 | 8 | 2020-03-01 | 3 | 0.4 | 2 | 18 |
*Unique group version number just for visualization.
Result should be:
grpid
1
4
7
Explanation:
- grpid 1 - there are 3 group versions (of 1 member) - 1 and 3 duplicate because userid, pct and hobby equal
- grpid 2 - there are 2 group versions (of 2 members) - not duplicate because hobby not equal between 5 and 7
- grpid 3 - there are 2 group versions (of 2 members) - not duplicate because pct is different among all members
- grpid 4 - there are 2 group versions (of 2 members) - all members are duplicate because userid, pct and userid equal
- grpid 5 - there is only one group of 3 members - not duplicate
- grpid 6 - there are 3 group versions (of 3 members) - not duplicate - the pct changed for each member in the group between versions
- grpid 7 - there are 3 group versions (of 3 members) - duplicate because userid, pct and userid equal between 28-30 and 31-33
- grpid 8 - there are 2 group versions one w/ one member and one with 2 members - not duplicate because there is another member in that group
I am using MySQL 5.7.
Solution
Hopefully this should provide an answser:
This implements "common" relational division:
SELECT DISTINCT grpid
FROM groups AS g
GROUP BY grpid, changeDate
HAVING NOT EXISTS
( SELECT 1
FROM groups AS gi
WHERE gi.grpid = g.grpid
AND gi.changeDate = g.changeDate
AND NOT EXISTS
( SELECT 1
FROM groups AS gk
WHERE gk.grpid = gi.grpid
AND gk.changeDate <> gi.changeDate
AND gk.userid = gi.userid
AND gk.pct = gi.pct
AND gk.hobby = gi.hobby
)
) ;
Result:
grid
1
4
7
This implements exact relational division, which results in even more complicated code. Your case, where the groupings/versions are determined by two columns (grpid, changeDate)
makes it look like even more complicated.
Tested in dbffdle.uk
Query 1:
SELECT DISTINCT g1.grpid
FROM
( SELECT grpid, changeDate
FROM groups AS g
GROUP BY grpid, changeDate
) AS g1
JOIN
( SELECT grpid, changeDate
FROM groups AS g
GROUP BY grpid, changeDate
) AS g2
ON g1.grpid = g2.grpid
AND g1.changeDate < g2.changeDate
WHERE NOT EXISTS
( SELECT 1
FROM groups AS gi
WHERE gi.grpid = g1.grpid
AND gi.changeDate = g1.changeDate
AND NOT EXISTS
( SELECT 1
FROM groups AS gk
WHERE gk.grpid = g2.grpid
AND gk.changeDate = g2.changeDate
AND gk.userid = gi.userid
AND gk.pct = gi.pct
AND gk.hobby = gi.hobby
)
)
AND NOT EXISTS
( SELECT 1
FROM groups AS gi
WHERE gi.grpid = g2.grpid
AND gi.changeDate = g2.changeDate
AND NOT EXISTS
( SELECT 1
FROM groups AS gk
WHERE gk.grpid = g1.grpid
AND gk.changeDate = g1.changeDate
AND gk.userid = gi.userid
AND gk.pct = gi.pct
AND gk.hobby = gi.hobby
)
)
;
Query 2 that uses MySQL's GROUP_CONCAT
function:
SELECT DISTINCT grpid
FROM
( SELECT grpid, changeDate,
GROUP_CONCAT( CONCAT_WS('-', userid, hobby, pct)
ORDER BY userid, hobby, pct
SEPARATOR ' '
) AS groupdata
FROM groups AS gr
GROUP BY grpid, changeDate
) AS g
GROUP BY grpid, groupdata
HAVING COUNT(*) > 1 ;
OTHER TIPS
You can use a inner join for that
Fyi mysql 8 duesn't like groups as table name, and must be put in backticks
And lines id = 2 and and id = 4, are also "identical, and fall under your rule
create table groups ( id int, grpid int , changeDate date, userid int, pct double , hobby int );
insert into groups(id , grpid , changeDate ,userid ,pct ,hobby ) values (1 , 1 ,'2020-01-01', 1 , 1 , 1 ), (2 , 1 ,'2020-01-02', 1 , 1 , 2 ), (3 , 1 ,'2020-01-03', 1 , 1 , 1 ), (4 , 2 ,'2020-01-01', 1 , 0.5 , 1 ), (5 , 2 ,'2020-01-01', 1 , 0.5 , 2 ), (6 , 2 ,'2020-01-02', 1 , 0.5 , 1 ), (7 , 2 ,'2020-01-02', 1 , 0.5 , 3 ), (8 , 3 ,'2020-01-01', 1 , 0.5 , 1 ), (9 , 3 ,'2020-01-01', 1 , 0.5 , 2 ), (10 , 3 ,'2020-01-02', 1 , 0.4 , 1 ), (11 , 3 ,'2020-01-02', 1 , 0.6 , 2 ), (12 , 4 ,'2020-01-01', 1 , 0.6 , 1 ), (13 , 4 ,'2020-01-01', 1 , 0.4 , 2 ), (14 , 4 ,'2020-01-02', 1 , 0.6 , 1 ), (15 , 4 ,'2020-01-02', 1 , 0.4 , 2 );
SELECT DISTINCT g1.grpid FROM `groups` g1 INNER JOIN `groups` g2 ON g1.grpid = g2.grpid AND g1.hobby = g2.hobby AND g1.pct = g2.pct AND g1.userid = g2.userid WHERE g1.id <> g2.id
| grpid | | ----: | | 1 | | 2 | | 4 |
db<>fiddle here
SELECT DISTINCT grpid
FROM TableWithGroups
GROUP BY grpid, userid, pct, hobby
HAVING COUNT(*) > 1
This uses the GROUP BY and HAVING clauses to respectively group up the values on the fields you specified to condense the duplicates into one row and filter where the count of rows condensed in each grouping was greater than 1 (in other words where there were duplicates).