Null value returned on Count Distinct (pl sql)

https://stackoverflow.com/questions/1013993

06-07-2019
|

Question

Preemptive apologies for the nonsensical table/column names on these queries. If you've ever worked with the DB backend of Remedy, you'll understand.

I'm having a problem where a Count Distinct is returning a null value, when I suspect the actual value should be somewhere in the 20's (23, I believe). Below is a series of queries and their return values.

SELECT count(distinct t442.c1)
      FROM t442, t658, t631
     WHERE t442.c1 = t658.c536870930
       AND t442.c200000003 = 'Network'
       AND t442.c536871139 < 2
       AND t631.c536870913 = t442.c1
       AND t658.c536870925 = 1
       AND (t442.c7 = 6 OR t442.c7 = 5)
       AND t442.c536870954 > 1141300800
       AND (t442.c240000010 = 0)

Result = 497.

Add table t649 and make sure it has records linked back to table t442:

 SELECT COUNT (DISTINCT t442.c1)
              FROM t442, t658, t631, t649
             WHERE t442.c1 = t658.c536870930
               AND t442.c200000003 = 'Network'
               AND t442.c536871139 < 2
               AND t631.c536870913 = t442.c1
               AND t658.c536870925 = 1
               AND (t442.c7 = 6 OR t442.c7 = 5)
               AND t442.c536870954 > 1141300800
               AND (t442.c240000010 = 0)
               AND t442.c1 = t649.c536870914

Result = 263.

Filter out records in table t649 where column c536870939 <= 1:

SELECT COUNT (DISTINCT t442.c1)
          FROM t442, t658, t631, t649
         WHERE t442.c1 = t658.c536870930
           AND t442.c200000003 = 'Network'
           AND t442.c536871139 < 2
           AND t631.c536870913 = t442.c1
           AND t658.c536870925 = 1
           AND (t442.c7 = 6 OR t442.c7 = 5)
           AND t442.c536870954 > 1141300800
           AND (t442.c240000010 = 0)
           AND t442.c1 = t649.c536870914
           AND t649.c536870939 > 1

Result = 24.

Filter on the HAVING statement:

SELECT COUNT (DISTINCT t442.c1)
          FROM t442, t658, t631, t649
         WHERE t442.c1 = t658.c536870930
           AND t442.c200000003 = 'Network'
           AND t442.c536871139 < 2
           AND t631.c536870913 = t442.c1
           AND t658.c536870925 = 1
           AND (t442.c7 = 6 OR t442.c7 = 5)
           AND t442.c536870954 > 1141300800
           AND (t442.c240000010 = 0)
           AND t442.c1 = t649.c536870914
           AND t649.c536870939 > 1
        HAVING COUNT (DISTINCT t631.c536870922) =
                                              COUNT (DISTINCT t649.c536870931)

Result = null.

If I run the following query, I can't see anything in the result list that would explain why I'm not getting any kind of return value. This is true even if I remove the DISTINCT from the SELECT. (I get 25 and 4265 rows of data back, respectively).

SELECT DISTINCT t442.c1, t631.c536870922, t649.c536870931
          FROM t442, t658, t631, t649
         WHERE t442.c1 = t658.c536870930
           AND t442.c200000003 = 'Network'
           AND t442.c536871139 < 2
           AND t631.c536870913 = t442.c1
           AND t658.c536870925 = 1
           AND (t442.c7 = 6 OR t442.c7 = 5)
           AND t442.c536870954 > 1141300800
           AND (t442.c240000010 = 0)
           AND t442.c1 = t649.c536870914
           AND t649.c536870939 > 1

I have several other places where I have the query set up exactly like the one that is returning the null value and it work perfectly fine--returning usable numbers that are the correct values. I have to assume that whatever is unique in this situation is related to data and not the actual query, but I'm not sure what to look for in the data to explain it. I haven't been able to find any null values in the raw data before aggregation. I don't know what else would cause this.

Any help would be appreciated.

Solution

I understand now. Your problem in the original query is that it is highly unusual (if not, in fact, wrong) to use a HAVING clause without a GROUP BY clause. The answer lies in the order of operation the various parts of the query are performed.

In the original query, you do this:

SELECT COUNT(DISTINCT t442.c1)
  FROM ...
 WHERE ...
HAVING COUNT(DISTINCT t631.c536870922) = COUNT(DISTINCT t649.c536870931);

The database will perform your joins and constraints, at which point it would do any group by and aggregation operations. In this case, you are not grouping, so the COUNT operations are across the whole data set. Based on the values you posted above, COUNT(DISTINCT t631.c536870922) = 25 and COUNT(DISTINCT t649.c536870931) = 24. The HAVING clause now gets applied, resulting in nothing matching - your asking for cases where the count of the total set (even though there are multiple c1s) are equal, and they are not. The DISTINCT gets applied to an empty result set, and you get nothing.

What you really want to do is just a version of what you posted in the example that spit out the rows counts:

SELECT count(*)
  FROM (SELECT t442.c1     
          FROM t442
             , t658
             , t631
             , t649
         WHERE t442.c1 = t658.c536870930
           AND t442.c200000003 = 'Network'
           AND t442.c536871139 < 2
           AND t631.c536870913 = t442.c1
           AND t658.c536870925 = 1
           AND (   t442.c7 = 6
                OR t442.c7 = 5)
           AND t442.c536870954 > 1141300800
           AND (t442.c240000010 = 0)
           AND t442.c1 = t649.c536870914
           AND t649.c536870939 > 1
         GROUP BY t442.c1
        HAVING COUNT(DISTINCT t631.c536870922) = COUNT(DISTINCT t649.c536870931)
       );

This will give you a list of the c1 columns that have equal numbers of the 631 & 649 table entries. Note: You should be very careful about the use of DISTINCT in your queries. For example, in the case where you posted the results above, it is completely unnecessary; oftentimes it acts as a kind of wallpaper to cover over errors in queries that don't return results the way you want due to a missed constraint in the WHERE clause ("Hmm, my query is returning dupes for all these values. Well, a DISTINCT will fix that problem").

OTHER TIPS

What is the result of:

SELECT COUNT (DISTINCT t631.c536870922),
       COUNT (DISTINCT t649.c536870931)
          FROM t442, t658, t631, t649
         WHERE t442.c1 = t658.c536870930
           AND t442.c200000003 = 'Network'
           AND t442.c536871139 < 2
           AND t631.c536870913 = t442.c1
           AND t658.c536870925 = 1
           AND (t442.c7 = 6 OR t442.c7 = 5)
           AND t442.c536870954 > 1141300800
           AND (t442.c240000010 = 0)
           AND t442.c1 = t649.c536870914
           AND t649.c536870939 > 1

If the two columns there never have equal values, then it makes sense that adding the HAVING clause would eliminate all rows from the result set.

COUNT(DISTINCT column) doesn't count NULL values:

SELECT  COUNT(DISTINCT val1)
FROM    (
        SELECT  NULL AS val1
        FROM    dual
        )

---
0

Could it be the case?

I would try putting the HAVING clause conditions in the WHERE clause instead. Is there any reason you chose HAVING? Just FYI, HAVING is a filter that is done after the result set is returned which may cause unexpected results. Also it is not used in the optimization of the query. If you don't have to use HAVING I would suggest not using it.

I would suggest adding the counts to the SELECT clause then joining them in the WHERE clause.

If I do this:

SELECT distinct t442.c1, count(distinct t631.c536870922), 
    count (distinct t649.c536870931)
          FROM t442, t658, t631, t649
         WHERE t442.c1 = t658.c536870930
           AND t442.c200000003 = 'Network'
           AND t442.c536871139 < 2
           AND t631.c536870913 = t442.c1
           AND t658.c536870925 = 1
           AND (t442.c7 = 6 OR t442.c7 = 5)
           AND t442.c536870954 > 1141300800
           AND (t442.c240000010 = 0)
           AND t442.c1 = t649.c536870914
           AND t649.c536870939 > 1
           group by t442.c1
           having count(distinct t631.c536870922)= 
                         count (distinct t649.c536870931)

I see the 23 rows that should be counted. Removing the HAVING statement returns 24 rows, the extra one which does not meet that HAVING criteria.

EDIT: Results of the query, as requested per Steve Broberg:

row | t442.c1         | cnt t631 | cnt 649
-------------------------------------------
1   | CHG000000230378 |    2     |    1
2   | CHG000000230846 |    1     |    1
3   | CHG000000232562 |    1     |    1
4   | CHG000000232955 |    1     |    1
5   | CHG000000232956 |    1     |    1
6   | CHG000000232958 |    1     |    1
7   | CHG000000233027 |    1     |    1
8   | CHG000000233933 |    1     |    1
9   | CHG000000233934 |    1     |    1
10  | CHG000000233997 |    1     |    1
11  | CHG000000233998 |    1     |    1
12  | CHG000000233999 |    1     |    1
13  | CHG000000234001 |    1     |    1
14  | CHG000000234005 |    1     |    1
15  | CHG000000234009 |    1     |    1
16  | CHG000000234012 |    1     |    1
17  | CHG000000234693 |    1     |    1
18  | CHG000000234696 |    1     |    1
19  | CHG000000234730 |    1     |    1
20  | CHG000000234839 |    1     |    1
21  | CHG000000235115 |    1     |    1
22  | CHG000000235224 |    1     |    1
23  | CHG000000235488 |    1     |    1
24  | CHG000000235847 |    1     |    1

The first row is filtered out properly if I include the HAVING clause.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow