Question

I have a set of numbers for a given set of attributes:

red    = 4
blue   = 0
orange = 2
purple = 1

I need to calculate the distribution percentage. Meaning, how diverse is the selection? Is it 20% diverse? Is it 100% diverse (meaning an even distribution of say 4,4,4,4)?

I'm trying to create a sexy percentage that approaches 100% the more the individual values average to the same value, and a lower value the more they get lopsided.

Has anyone done this?

Here is the PHP conversion of the below example. For some reason it's not producing 1.0 with a 4,4,4,4 example.

$arrayChoices = array(4,4,4,4);

foreach($arrayChoices as $p)
    $sum += $p;

print "sum: ".$sum."<br>";

$pArray = array();

foreach($arrayChoices as $rec)
{
    print "p vector value: ".$rec." ".$rec / $sum."\n<br>";
    array_push($pArray,$rec / $sum);
}   
$total = 0;

foreach($pArray as $p)
    if($p > 0)
        $total = $total - $p*log($p,2);

print "total = $total <br>";

print round($total / log(count($pArray),2) *100);

Thanks in advance!

Was it helpful?

Solution

A simple, if rather naive, scheme is to sum the absolute differences between your observations and a perfectly uniform distribution

red    = abs(4 - 7/4) = 9/4
blue   = abs(0 - 7/4) = 7/4
orange = abs(2 - 7/4) = 1/4
purple = abs(1 - 7/4) = 3/4

for a total of 5.
A perfectly even spread will have a score of zero which you must map to 100%.
Assuming you have n items in c categories, a perfectly uneven spread will have a score of

(c-1)*n/c + 1*(n-n/c) = 2*(n-n/c)

which you should map to 0%. For a score d, you might use the linear transformation

100% * (1 - d / (2*(n-n/c)))

For your example this would result in

100% * (1 - 5 / (2*(7-7/4))) = 100% * (1 - 10/21) ~ 52%

Better yet (although more complicated) is the Kolmogorov–Smirnov statistic with which you can make mathematically rigorous statements about the probability that a set of observations have some given underlying probability distribution.

OTHER TIPS

One possibility would be to base your measure on entropy. The uniform distribution has maximum entropy, so you could create a measure as follows:

1) Convert your vector of counts to P, a vector of proportions (probabilities).

2) Calculate the entropy function H(P) for your vector of probabilities P.

3) Calculate the entropy function H(U) for a vector of equal probabilities which has the same length as P. (This turns out to be H(U) = -log(1.0 / length(P)), so you don't actually need to create U as a vector.)

4) Your diversity measure would be 100 * H(P) / H(U).

Any set of equal counts yields a diversity of 100. When I applied this to your (4, 0, 2, 1) case, the diversity was 68.94. Any vector with all but one element having counts of 0 has diversity 0.

ADDENDUM

Now with source code! I implemented this in Ruby.

def relative_entropy(v)
  # Sum all the values in the vector v, convert to decimal
  # so we won't have integer division below...
  sum = v.inject(:+).to_f

  # Divide each value in v by sum, store in new array p
  pvals = v.map{|value| value / sum}

  # Build a running total by calculating the entropy contribution for
  # each p.  Entropy is zero if p is zero, in which case total is unchanged.
  # Finally, scale by the entropy equivalent of all proportions being equal.
  pvals.inject(0){|total,p| p > 0 ? (total - p*Math.log2(p)) : total} / Math.log2(pvals.length)
end

# Scale these by 100 to turn into a percentage-like measure
relative_entropy([4,4,4,4])     # => 1.0
relative_entropy([4,0,2,1])     # => 0.6893917467430877
relative_entropy([16,0,0,0])    # => 0.0
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top