Efficient algorithm for detecting different elements in a collection

https://stackoverflow.com/questions/2326399

22-09-2019
|

문제

Imagine you have a set of five elements (A-E) with some numeric values of a measured property (several observations for each element, for example "heart rate"):

A = {100, 110, 120, 130}
B = {110, 100, 110, 120, 90}
C = { 90, 110, 120, 100}
D = {120, 100, 120, 110, 110, 120}
E = {110, 120, 120, 110, 120}

First, I have to detect if there are significant differences on the average levels. So I run a one way ANOVA using the Statistical package provided by Apache Commons Math. No problems so far, I obtain a boolean that tells me whether differences are found or not.

Second, if differences are found, I need to know the element (or elements) that is different from the rest. I plan to use unpaired t-tests, comparing each pair of elements (A with B, A with C .... D with E), to know if an element is different than the other. So, at this point I have the information of the list of elements that present significant differences with others, for example:

C is different than B
C is different than D

But I need a generic algorithm to efficiently determine, with that information, what element is different than the others (C in the example, but could be more than one).

Leaving statistical issues aside, the question could be (in general terms): "Given the information about equality/inequality of each one of the pairs of elements in a collection, how could you determine the element/s that is/are different from the others?"

Seems to be a problem where graph theory could be applied. I am using Java language for the implementation, if that is useful.

Edit: Elements are people and measured values are times needed to complete a task. I need to detect who is taking too much or too few time to complete the task in some kind of fraud detection system.

해결책

Just in case anyone is interested in the final code, using Apache Commons Math to make statistical operations, and Trove to work with collections of primitive types.

It looks for the element(s) with the highest degree (the idea is based on comments made by @Pace and @Aniko, thanks).

I think the final algorithm is O(n^2), suggestions are welcome. It should work for any problem involving one cualitative vs one cuantitative variable, assuming normality of the observations.

import gnu.trove.iterator.TIntIntIterator;
import gnu.trove.map.TIntIntMap;
import gnu.trove.map.hash.TIntIntHashMap;
import gnu.trove.procedure.TIntIntProcedure;
import gnu.trove.set.TIntSet;
import gnu.trove.set.hash.TIntHashSet;

import java.util.ArrayList;
import java.util.List;

import org.apache.commons.math.MathException;
import org.apache.commons.math.stat.inference.OneWayAnova;
import org.apache.commons.math.stat.inference.OneWayAnovaImpl;
import org.apache.commons.math.stat.inference.TestUtils;


public class TestMath {
    private static final double SIGNIFICANCE_LEVEL = 0.001; // 99.9%

    public static void main(String[] args) throws MathException {
        double[][] observations = {
           {150.0, 200.0, 180.0, 230.0, 220.0, 250.0, 230.0, 300.0, 190.0 },
           {200.0, 240.0, 220.0, 250.0, 210.0, 190.0, 240.0, 250.0, 190.0 },
           {100.0, 130.0, 150.0, 180.0, 140.0, 200.0, 110.0, 120.0, 150.0 },
           {200.0, 230.0, 150.0, 230.0, 240.0, 200.0, 210.0, 220.0, 210.0 },
           {200.0, 230.0, 150.0, 180.0, 140.0, 200.0, 110.0, 120.0, 150.0 }
        };

        final List<double[]> classes = new ArrayList<double[]>();
        for (int i=0; i<observations.length; i++) {
            classes.add(observations[i]);
        }

        OneWayAnova anova = new OneWayAnovaImpl();
//      double fStatistic = anova.anovaFValue(classes); // F-value
//      double pValue = anova.anovaPValue(classes);     // P-value

        boolean rejectNullHypothesis = anova.anovaTest(classes, SIGNIFICANCE_LEVEL);
        System.out.println("reject null hipothesis " + (100 - SIGNIFICANCE_LEVEL * 100) + "% = " + rejectNullHypothesis);

        // differences are found, so make t-tests
        if (rejectNullHypothesis) {
            TIntSet aux = new TIntHashSet();
            TIntIntMap fraud = new TIntIntHashMap();

            // i vs j unpaired t-tests - O(n^2)
            for (int i=0; i<observations.length; i++) {
                for (int j=i+1; j<observations.length; j++) {
                    boolean different = TestUtils.tTest(observations[i], observations[j], SIGNIFICANCE_LEVEL);
                    if (different) {
                        if (!aux.add(i)) {
                            if (fraud.increment(i) == false) {
                                fraud.put(i, 1);
                            }
                        }
                        if (!aux.add(j)) {
                            if (fraud.increment(j) == false) {
                                fraud.put(j, 1);
                            }
                        }
                    }           
                }
            }

            // TIntIntMap is sorted by value
            final int max = fraud.get(0);
            // Keep only those with a highest degree
            fraud.retainEntries(new TIntIntProcedure() {
                @Override
                public boolean execute(int a, int b) {
                    return b != max;
                }
            });

            // If more than half of the elements are different
            // then they are not really different (?)
            if (fraud.size() > observations.length / 2) {
                fraud.clear();
            }

            // output
            TIntIntIterator it = fraud.iterator();
            while (it.hasNext()) {
                it.advance();
                System.out.println("Element " + it.key() + " has significant differences");             
            }
        }
    }
}

다른 팁

Your edit gives good details; thanks,

Based on that I would presume a fairly well-behaved distribution of times (normal, or possibly gamma; depends on how close to zero your times get) for typical responses. Rejecting a sample from this distribution could be as simple as computing a standard deviation and seeing which samples lie more than n stdevs from the mean, or as complex as taking subsets which exclude outliers until your data settles down into a nice heap (e.g. the mean stops moving around 'much').

Now, you have an added wrinkle if you assume that a person who monkeys with one trial will monkey with another. So you're erally trying to discriminate between a person who just happens to be fast (or slow) vs. one who is 'cheating'. You could do something like compute the stdev rank of each score (I forget the proper name for this: if a value is two stdevs above the mean, the score is '2'), and use that as your statistic.

Then, given this new statistic, there are some hypotheses you'll need to test. E.g., my suspicion is that the stdev of this statistic will be higher for cheaters than for someone who is just uniformly faster than other people--but you'd need data to verify that.

Good luck with it!

FQL을 사용하여 앱의 앨범 (https://developers.facebook.com/docs/reference/fql/photo/)의 사진을 쿼리합니다.aid== 앨범 ID.

검색을 사용하여 검색을 좁힐 수 있고 IN 절이 특정 앨범 ID의 http://developers.facebook.com/docs/reference/fql/album/

If the items in the list were sorted in numerical order, you can walk two lists simultaneously, and any differences can easily be recognized as insertions or deletions. For example

List A    List B
  1         1       // Match, increment both pointers
  3         3       // Match, increment both pointers
  5         4       // '4' missing in list A. Increment B pointer only.

List A    List B
  1         1       // Match, increment both pointers
  3         3       // Match, increment both pointers
  4         5       // '4' missing in list B (or added to A). Incr. A pointer only.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow