Calculate variance of frequencies when dataset does not contain entries of frequency zero

https://stackoverflow.com/questions/16527961

29-05-2022
|

Вопрос

I have a dataset that has three fields: id, feature and frequency. What I want to do is find out, for a group of given id's, which feature has the largest spread of frequencies. The result I want is that if I split the group of id's into two sub-groups, using the median value of frequency for that feature, that I have two groups which are most different from each other and yet are roughly of equal size.

My first thought was that I calculate the variance of the frequencies for each feature and use the feature where the variance is the highest.

Given a database table which looks something like this:

id | feature | frequency
---+---------+-------------
 0 | 0       | 1
 0 | 1       | 1
 0 | 2       | 0
 1 | 0       | 2
 1 | 1       | 2
 1 | 2       | 0
 2 | 0       | 3
 2 | 1       | 3
 2 | 2       | 8
 3 | 0       | 4
 3 | 1       | 8
 3 | 2       | 10
 4 | 0       | 5
 4 | 1       | 10
 4 | 2       | 12

Feature 0 has frequencies of 1, 2, 3, 4, 5
Feature 1 has frequencies of 1, 2, 3, 9, 10
Feature 2 has frequencies of 0, 0, 4, 10, 12

We can see that feature 2 has the biggest spread and that splitting on 4 would make a nice point to split into two groups (0, 0 and 4 into one group and 10 and 12 into the other group).

I can calculate this with the following SQL query:

SELECT feature, variance(frequency) as f FROM Dataset WHERE id IN (<list of ids>) GROUP BY feature ORDER BY f DESC LIMIT 1;

This works fine, but has one flaw. My dataset is sparse (most entries have a frequency of zero) and it is expensive for me (both in terms of space and in terms of time it takes to insert the entries) to store the zero frequency items in the database. Therefore my actual tables look something like this:

id | feature | frequency
---+---------+-------------
 0 | 0       | 1
 0 | 1       | 1
 1 | 0       | 2
 1 | 1       | 2
 2 | 0       | 3
 2 | 1       | 3
 2 | 2       | 8
 3 | 0       | 4
 3 | 1       | 8
 3 | 2       | 10
 4 | 0       | 5
 4 | 1       | 10
 4 | 2       | 12

The above SQL query does not get the correct results now, as it needs to consider the zero frequency entries to calculate the correct variance value. My SQL skills aren't good enough to figure out a (performant) query that can get around this limitation...

My next thought was to calculate the maximum entropy instead but that suffers from the fact that it does not take the actual frequency values (and also the "frequency"/counts of times the same frequency value is in the same dataset) into account - only the number of distinct values. Unless I'm misunderstanding the entropy formula.

So my questions are:

Is there is a way to do this in SQL?
If not, is there a way of "adjusting" the variance calculated to account for the number of zero entries? (Assume I know how many zero entries were omitted)
If yes, is there a way of doing this in a single SQL query as above? (again, assume I know beforehand how many zero entries were omitted)
If neither are possible, is there a way of using entropy and adjusting for the actual values?
Is there some other measure (eg kurtosis?) that I should consider? Are there any that can easily be adjusted for missing zero entries?
Or any other suggestions or alternative solutions?

Решение

With respect to filling in the gaps in your table, you can use a "helper" temp table with the valid list of features to UNION the missing zero-frequency values by way of a CROSS JOIN. The "how" really depends on the database language you are using. For example, suppose you have a table named "helper" with three rows (for your three different features). This then might work:

select id, feature, frequency
from have
union
select b.id
     , a.feature
     , 0 as frequency
from helper a
cross join have b
where not exists (
   select 1 from have b1
   where b1.id=b.id
     and b1.feature = a.feature
   )

Here is an SQLFiddle.

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow