質問

I have a set of numbers for a given set of attributes:

red    = 4
blue   = 0
orange = 2
purple = 1

I need to calculate the distribution percentage. Meaning, how diverse is the selection? Is it 20% diverse? Is it 100% diverse (meaning an even distribution of say 4,4,4,4)?

I'm trying to create a sexy percentage that approaches 100% the more the individual values average to the same value, and a lower value the more they get lopsided.

Has anyone done this?

Here is the PHP conversion of the below example. For some reason it's not producing 1.0 with a 4,4,4,4 example.

$arrayChoices = array(4,4,4,4);

foreach($arrayChoices as $p)
    $sum += $p;

print "sum: ".$sum."<br>";

$pArray = array();

foreach($arrayChoices as $rec)
{
    print "p vector value: ".$rec." ".$rec / $sum."\n<br>";
    array_push($pArray,$rec / $sum);
}   
$total = 0;

foreach($pArray as $p)
    if($p > 0)
        $total = $total - $p*log($p,2);

print "total = $total <br>";

print round($total / log(count($pArray),2) *100);

Thanks in advance!

役に立ちましたか?

解決

A simple, if rather naive, scheme is to sum the absolute differences between your observations and a perfectly uniform distribution

red    = abs(4 - 7/4) = 9/4
blue   = abs(0 - 7/4) = 7/4
orange = abs(2 - 7/4) = 1/4
purple = abs(1 - 7/4) = 3/4

for a total of 5.
A perfectly even spread will have a score of zero which you must map to 100%.
Assuming you have n items in c categories, a perfectly uneven spread will have a score of

(c-1)*n/c + 1*(n-n/c) = 2*(n-n/c)

which you should map to 0%. For a score d, you might use the linear transformation

100% * (1 - d / (2*(n-n/c)))

For your example this would result in

100% * (1 - 5 / (2*(7-7/4))) = 100% * (1 - 10/21) ~ 52%

Better yet (although more complicated) is the Kolmogorov–Smirnov statistic with which you can make mathematically rigorous statements about the probability that a set of observations have some given underlying probability distribution.

他のヒント

One possibility would be to base your measure on entropy. The uniform distribution has maximum entropy, so you could create a measure as follows:

1) Convert your vector of counts to P, a vector of proportions (probabilities).

2) Calculate the entropy function H(P) for your vector of probabilities P.

3) Calculate the entropy function H(U) for a vector of equal probabilities which has the same length as P. (This turns out to be H(U) = -log(1.0 / length(P)), so you don't actually need to create U as a vector.)

4) Your diversity measure would be 100 * H(P) / H(U).

Any set of equal counts yields a diversity of 100. When I applied this to your (4, 0, 2, 1) case, the diversity was 68.94. Any vector with all but one element having counts of 0 has diversity 0.

ADDENDUM

Now with source code! I implemented this in Ruby.

def relative_entropy(v)
  # Sum all the values in the vector v, convert to decimal
  # so we won't have integer division below...
  sum = v.inject(:+).to_f

  # Divide each value in v by sum, store in new array p
  pvals = v.map{|value| value / sum}

  # Build a running total by calculating the entropy contribution for
  # each p.  Entropy is zero if p is zero, in which case total is unchanged.
  # Finally, scale by the entropy equivalent of all proportions being equal.
  pvals.inject(0){|total,p| p > 0 ? (total - p*Math.log2(p)) : total} / Math.log2(pvals.length)
end

# Scale these by 100 to turn into a percentage-like measure
relative_entropy([4,4,4,4])     # => 1.0
relative_entropy([4,0,2,1])     # => 0.6893917467430877
relative_entropy([16,0,0,0])    # => 0.0
ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top