Thanks to @vogomatix, since his answer helped me understand my problem and where I was wrong. The last query actually results in a number of rows showing each pair of duplicates with no repetitions, but it's not suitable to count for them as the sum(total)
from the first one. Given this case:
DID | XDOCKEYPHX
---------------
1 | 1
2 | 1
3 | 1
4 | 2
5 | 2
6 | 3
7 | 3
8 | 3
9 | 3
The first inner query would return
DID | XDOCKEYPHX
---------------
1 | 3
2 | 2
3 | 4
And the full query would be count = 3
, meaning there are 3 documents with n duplicates, and the total duplicated documents sum(total) = 9
.
Now, the second and third query, if we use just a select *
, will give something like:
DID_1 | XDOCKEYPHX | DID_2
--------------------------
2 | 1 | 1
3 | 1 | 1
3 | 1 | 2
5 | 2 | 4
7 | 3 | 6
8 | 3 | 6
8 | 3 | 7
9 | 3 | 6
9 | 3 | 7
9 | 3 | 8
So now, the second query select count(distinct(xdockeyphx))
will give the correct value 3, but the third query select count(*)
will give 10, which well, is incorrect for me since I wanted to know the sum of duplicates for each DID (9). What the third query gives you is all the pairs of duplicates, so you can then compare them or whatever. My misunderstanding was thinking that if I counted all the rows in the third query, I should get the sum of duplicates for each DID (sum(total)
of the first query), which was a wrong idea and now I realize it.