Domanda

I have a C array with n values. What I need to calculate is how many of the values at the extremes of the array, once sorted, need to be stripped away so that the standard deviation of the remaining values is less than a certain number.

I can't make any assumptions about these values in terms of how they are dispersed or their values. They can be either positive or negative.

I've written the code below, but it's a sledge hammer, and takes forever to run, because this function is called > 1,000 times per second, and the length of the array can be anywhere from 100 to 12,000. Is there a better solution?

double standardDeviation(double values[], int length)
{
    double sdAverage = average(values, length);
    double stDev = 0;
    for (int i = 0; i < length; ++i)
        stDev += pow((values[i] - sdAverage), 2);

    return( sqrt(stDev/length) );
} 

int countItemsThatFit(double values[], int length, int sDevTarget)
{
    qsort(values, length, sizeof (double), compareDoubles);
    int headIndex = 0, tailIndex  = length - 1;

    double sDev        = standardDeviation(values, length);
    int valuesRemaining = length; 

    while (sDev > sDevTarget && valuesRemaining > 0)
    {
        // try removing head and tail (separately) to find which sDev is smaller
        double headStrippedDev = standardDeviation(&values[headIndex + 1], valuesRemaining);
        double tailStrippedDev = standardDeviation(&values[headIndex    ], valuesRemaining - 1);

        if (headStrippedDev <= tailStrippedDev)
            ++headIndex;
        --valuesRemaining;
    }   
    return (valuesRemaining);
}
È stato utile?

Soluzione

Instead of removing values from the array and recomputing the standard deviation, conceive an empty array, add values to it and compute the standard deviation using an updating formula.

The trick to this is to add the values in an ordering such that the standard deviation is non-decreasing as values are added. This can be accomplished by adding values that are closest to the mean of the original array. In other words, the values with the smallest absolute deviation from the mean.

So, the algorithm that should work:

  1. Compute mean of array
  2. Sort array comparing items by absolute deviation from the mean
  3. Set N = 2
  4. Compute the standard deviation for the first two elements
  5. While standard deviation < target standard deviation AND the array still has elements
    1. Set N = N + 1
    2. Take next item from array
    3. Recompute the standard deviation with the additional value using an updating formula
  6. If array still has elements return N
  7. Else return N - 1

Here is an example case to demonstrate the algorithm:

Source Array: [10, 14, 16, 18, 20, 22, 24]
Target SD: 2.5

Step 1) Mean = 17.0
Step 2) [16, 18, 14, 20, 22, 10, 24]
Step 4) SD([16,18]) = 1.4142
Step 5.2) SD([16,18,14]) = 2
Step 5.2) SD([16,18,14,20]) = 2.5820

Return 3

Altri suggerimenti

If you have a collection of n values with a mean m and standard deviation s, then if you add another element x you can compute the new standard deviation by noticing (using the ' to indicate "new" value)

m' = (n * m + x) / (n + 1) = m + (x-m) / (n+1)
s'^2 = (n-1)/n * s + (m-m')^2 / (n+1)

Going the other way, you can find the new mean and variance when you remove a value as well.

Using that, you can compute the new standard deviation from the old one with a simple expression - you don't have to loop over all the values again and again.

All you have to do is:

  1. sort the data
  2. compute mean and standard deviation
  3. find the "leftmost" and "rightmost" value
  4. whichever of these is the furthest from the mean is the one to remove
  5. after removing, compute the new mean and standard deviation
  6. see if you are within limits
  7. if not, allocate a new "leftmost" and "rightmost" (one of these is the same as before - you only replace the one that you took away)
  8. repeat from step 5 until you are done

As for step 5, the formula for the mean when you remove an element is basically the same. If you had n samples and remove one (x), the new mean is

m' = (n * m - x) / (n - 1) = m + (m - x)/(n - 1)

and denoting the current sum of squares as M (= sum((x-m)^2) ), we find

M' = M - (x-m)*(x-m')
s' = sqrt( M' / (n-2))

I think it got this right... you might want to check to make sure of the signs and the +1, -2 etc though.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top