Frage

Let's say I have m integers from n disjoint integer intervals, which are in some sense "far" apart.
n is not known beforehand, but it is known to be small (see assumptions below).

For example, for n = 3, I might have been given randomly distributed integers from the intervals 105-2400, 58030-571290, 1000000-1000100.

Finding the minimum (105) and maximum (1000100) is clearly trivial.
But is there any way to efficiently (in O(m) time and hopefully o(m) space) find the intervals' boundary points, so that I can quickly partition the data for separate processing?

If there is no efficient way to do this exactly, is there an efficient way to approximate the partitions, to within a small constant factor (like 2)?
(For example, 4000 would be an acceptable approximation of the upper bound of the smaller interval, and 30000 would be an acceptable approximation of the lower bound of the middle interval.)

Assumptions:

  1. Everything is nonnegative

  2. n is very small (say, < 10)

  3. The max value is comparatively large (say, on the order of 226)

  4. The intervals are dense (i.e. there exists an integer in the array for most values inside that interval)

  5. Two clusters are far apart if their closest boundaries are at least a constant factor c apart.
    (Edit: It makes sense for c to be relative to the cluster size rather than relative to the bound. So a cluster of 1 element at 1000000 should not be approximated as originating from the interval 500000-2000000.)

  6. The integers are not sorted, and this is crucial. In fact sorting them in O(m) time is impossible without radix sort, but radix sort could have O(max value) complexity, and there is no guarantee the max value is anywhere close to m.

Again, speed is the most important factor here; inaccuracy is tolerated as long as it's within a reasonable factor.

War es hilfreich?

Lösung 4

Actually I think I was wrong when I posted the question -- the answer does seem to be radix sort.

The number of buckets is arbitrary, it doesn't have to correlate with the sizes of the intervals.
It might even be 2, if I go bit-by-bit.

Thus radix sort could help me sort the data in O(m log(max value)) ≈ O(m) time (since log(max value) is essentially a constant factor of 26 according to the assumptions), at which point the problem becomes trivial.

Andere Tipps

I say move to a logarithmic scale of factor c to search for the intervals, since you know them being at least c factor apart. Then, make an array of counters, each counter counts numbers in intervals (0.5X .. 0.5X+0.5) logarithmic scale, where X is the index of selected counter.

Say, c is 2, and you know the maximum upper bound of 226, so you create 52 counters, then calculate floor(2*log<sub>2</sub>i) where i is the current integer, and increment that counter. After you parse all of m integers, walk through that array, and each sequence of zeroes in there will mean that the corresponding logarithmic interval is empty.

So the output of this will be the sequence of occupied intervals, logarithmically aligned to halves of power of c, aka 128, 181, 256, 363, 512 etc. This satisfies your requirements on precision for intervals' boundaries.

Update: You can also store lowest and highest number out of those that hit the interval. Once you do, the intervals' boundaries are calculated as follows:

  1. Find the first nonzero counter from current position in the counters array. Take its lowest number as lower bound.
  2. Progress through the counters array until you find a zero in the counter or hit the array's end. Take the last nonzero counter's highest number. This will be your upper bound for the current interval.
  3. Proceed until full traversal of the array.
  4. Return the set of intervals found. The boundaries will be strict.

An example: (abstract language code)

counters=[];
lowest=[];
highest=[];
for (i=0;i<m;i++) { 
    x=getNextInteger();
    n=Math.floor(2.0*logByBase(c,x));
    counters[n]++;
    if (counters[n]==1) {
        lowest[n]=x;
        highest[n]=x;
    } else {
        if (lowest[n]>x) lowest[n]=x;
        if (highest[n]<x) highest[n]=x;
    }
}
zeroflag=true; /// are we in mode of finding a zero or a nonzero
intervals=[];
currentLow=0;
currentHigh=0;
for (i=0;i<counters.length;i++) {
    if (zeroflag) {
        // we search for nonzero
        if (counters[i]>0) {
            currentLow=lowest[i]; // there's a value
            zeroflag=false;
        } // else skip
    } else {
        if (counters[i]==0) {
            currentHigh=highest[i-1]; // previous was nonzero, get its highest
            intervals.push([currentLow,currentHigh]); // store interval
            zeroflag=true;
    }
}
if (!zeroflag) { // unfinished interval 
    currentHigh=highest[counters.length-1];
    intervals.push([currentLow,currentHigh]); // store last interval
}

You may want to look at aproximate median finding.

These methods can often be generalized to finding arbitrary quantiles with a reasonable precision; and quantiles are good for distributing your work load.

Here's my take, using two passes over the data set.

  1. sample 10000 objects from your data set.

  2. Solve your problem for the sample objects.

  3. Re-scan your data set, assigning each object to the nearest interval from your sample, and tracking the minimum and maximum of each interval.

If your gaps are prominent enough, they should still be visible in the sample. The second pass is only to refine the interval boundaries.

Split the total range into buckets. The bucket boundaries X_i should be spaced in an appropriate way, like e.g. linearly X_i=16*i. Other options would be quadratic spacing like X_i=4*i*i or logarithmic X_i=2^(i/16), here the total number of buckets would be smaller but finding the right bucket for a given number would be more effort. Each bucket is empty or non-empty, so one bit would be sufficient.
You iterate over the set of numbers, and for each number you mark its bucket as non-empty. Then the gaps between your intervals are represented by series of empty buckets. So now you find all sufficiently long series of empty buckets, and you have the interval gaps. The accuracy of the interval boundary will be determined by the bucket size, so assuming a bucket size of 16 you interval border is off by at most 15. If the max number is 226 and buckets are size 16 and you use one bit for each bucket you need 219 byte, or 512kB of memory.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top