Self join vs group by when counting duplicates

Question 1

Thanks to @vogomatix, since his answer helped me understand my problem and where I was wrong. The last query actually results in a number of rows showing each pair of duplicates with no repetitions, but it's not suitable to count for them as the sum(total) from the first one. Given this case:

DID | XDOCKEYPHX
---------------
1   |    1
2   |    1
3   |    1
4   |    2
5   |    2
6   |    3
7   |    3
8   |    3
9   |    3

The first inner query would return

DID | XDOCKEYPHX
---------------
1   |    3
2   |    2
3   |    4

And the full query would be count = 3, meaning there are 3 documents with n duplicates, and the total duplicated documents sum(total) = 9.

Now, the second and third query, if we use just a select *, will give something like:

DID_1 | XDOCKEYPHX | DID_2
--------------------------
2     |     1      |    1
3     |     1      |    1
3     |     1      |    2
5     |     2      |    4
7     |     3      |    6
8     |     3      |    6
8     |     3      |    7
9     |     3      |    6
9     |     3      |    7
9     |     3      |    8

So now, the second query select count(distinct(xdockeyphx)) will give the correct value 3, but the third query select count(*) will give 10, which well, is incorrect for me since I wanted to know the sum of duplicates for each DID (9). What the third query gives you is all the pairs of duplicates, so you can then compare them or whatever. My misunderstanding was thinking that if I counted all the rows in the third query, I should get the sum of duplicates for each DID (sum(total) of the first query), which was a wrong idea and now I realize it.

Question 2

You don't need the complexity of your last where clause

where doc1.did > doc2.did
and doc1.xdockeyphx = doc2.xdockeyphx
and doc1.xdockeyphx is not null
and doc2.xdockeyphx is not null

If you think about it, doc2.xdockeyphx cannot be null if doc1.xdockeyphx is not null. perhaps it is better expressed by joining tables....

select count(*)
from ecm_ocs.docmeta doc1
join ecm_ocs.docmeta doc2
on doc1.xdockeyphx = doc2.xdockeyphx
where doc1.xdockeyphx is not null and doc1.did > doc2.did

Your first two queries report distinct/grouped results where your last one simply reports all results, which is why the counts differ.

Question 3

In the third query, column names are duplicated due to the use of (*), you should maybe replace select count(*) by select count(doc1.*)

Question 4

Lets keep it simple.

SELECT FROM_ID,
       TO_ID
FROM   TABLE1;

This fetches

Note: To Id is the PK on this table

On your first query (Of course I changed the predicates)

SELECT COUNT ( DOCKEY ), SUM ( TOTAL )
FROM   (SELECT   DOC1.TO_ID DOCKEY, COUNT ( DOC1.TO_ID ) TOTAL
        FROM     TABLE1 DOC1
        GROUP BY DOC1.TO_ID
        HAVING   COUNT ( DOC1.TO_ID ) > 0);

Produces

5    5

Here I selected rows grouped by TO_ID which will produce five rows in the sub query and then the aggregation in the main query causes it to be counted as 5.

Now in the second query, even if you replace the select with COUNT(*) as in the third you should get the same count. The reason is I am joining them on the PK.

SELECT COUNT ( DISTINCT ( DOC1.TO_ID ) )
FROM   TABLE1 DOC1, TABLE1 DOC2
WHERE  DOC1.TO_ID = DOC2.TO_ID;

5


SELECT COUNT(*)
FROM   TABLE1 DOC1, TABLE1 DOC2
WHERE  DOC1.TO_ID = DOC2.TO_ID;

5

But in your case, you are not using the PK in the join and you use it as a predicate.

TABLE1.COL1 = TABLE1.COL1 in a self join will make it as a JOIN ON TABLE1.COL1 > TABLE1.COL1 in a self join will make it as Cartesian product.

So in your second query, you used DISTINCT which saved you from this duplicates and not in the third which is a mere count of returned rows. To check this, you can do a select *