Question

Say I have the following example distribution (vector) of numbers in c++:

vector 1    vector 2   vector 3
11          4          65
128         6          66
12          4          64
13          4          62
12          5          65
14          5          63
16          7          190
60          3          210
120         4          220
126         5          242
77          6          231
14          4          210
12          7          222
13          6          260
11          8          300
14          6          233
99                     80
15                     66
13

I need to find a threshold for each vector. I'll eliminate the larger ("bad") numbers in each if they are above that vector's threshold. I want to re-use this method to find a threshold on other similar vectors in the future. The numbers aren't necessarily mostly smaller "good" numbers.

The threshold would ideally be just a hair larger than most of the smaller "good" numbers. For example, the first vetor's ideal threshold value would be around 17 or 18, the second's would be about 8, and the third's would be around 68-70.

I realize this is probably simple math but since I'm horrible at math in general, I would really appreciate a code example on how to find this magical threshold, in either C++ or Objective-C specifically, which is why I'm posting this in SO and not on the Math site.

Some things I've tried

float threshold = mean_of_vector;
float threshold = mean_of_vector / 1.5f;
float threshold = ((max_of_vector - min_of_vector) / 2.0f) + mean_of_vector;

Each of these seem to have their own issues, eg: some include too many of the "good" average numbers (so the threshold was too low), some not enough good numbers (threshold too high), or not enough of the "bad" numbers. Sometimes they'll work with specific vectors of numbers, for example, if the standard deviation is high, but not others where the standard deviation is low.

I'm thinking the method would involve standard deviation and/or some sort of gaussian distribution, but I don't know how to piece them together to get the desired result.

Edit: I am able to re-sort the vectors.

Was it helpful?

Solution

You could just eliminate the values above 90% or 95%. Technicaly you calculate the p = 0.9 (or 0.95) percentile of the array distribution.

Just sort the array ascending:

int[] data;

Arrays.sort(data); // or use ArrayList<Integer> which has Collections.sort(dataArrayList),

Then calculate position of percentile p:

float p = data.length * p; // e.g p = 0.9 for 90% percentile.
// cut of fractional part.
int posInt = (int) p;

// this is the threshold value 
int threshold = data[posInt]

Now filter array by keeping all value < or <= threshold. This keeps the 90% of smallest values.

int i = 0;
while (i < data.length && data[i] <= threshold) {
  // output data[i];
}

For mathematically "perfect" results you could search for "calculate percentile of discrete array / values). As i remeber there are two valid algorithms, describeing whether one has to round down or round up the posInt. I my example above I just truncated.

OTHER TIPS

An idea would be to compute both the mean mu and the standard deviation sigma (e.g. using the algorithm described at "Accurately computing running variance" ) and use both of them for the definition of your threshold.

If your data are assumed to be Gaussian, you know that 97.5% of your data should be below mu + 2*sigma, so that can be a good threshold.

Remark: you might want to recompute your threshold once you have rejected the extreme values since these values can have a significant impact on the mean and the standard deviation.

EDIT:

I just computed the thresholds using the method I proposed and it does not look satisfactory for you: for the first case, threshold is around 130 (so maybe taking 1.5 sigma could help getting rid of the largest entries), for the second case, threshold is around 8 and for the third case, threshold is around 262.

Actually, I'm not that surprised with these results: for your last example, you want to get rid of more than the half of the data! Assuming that the data are Gaussian with just a few extrem values is far from what you have at hand...

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top