How to detect partitions (clusters) of sparse data in linear time, and (hopefully) sublinear space?

Question 1

Actually I think I was wrong when I posted the question -- the answer does seem to be radix sort.

The number of buckets is arbitrary, it doesn't have to correlate with the sizes of the intervals.
It might even be 2, if I go bit-by-bit.

Thus radix sort could help me sort the data in O(m log(max value)) ≈ O(m) time (since log(max value) is essentially a constant factor of 26 according to the assumptions), at which point the problem becomes trivial.

Question 2

I say move to a logarithmic scale of factor c to search for the intervals, since you know them being at least c factor apart. Then, make an array of counters, each counter counts numbers in intervals (0.5X .. 0.5X+0.5) logarithmic scale, where X is the index of selected counter.

Say, c is 2, and you know the maximum upper bound of 2²⁶, so you create 52 counters, then calculate floor(2*log<sub>2</sub>i) where i is the current integer, and increment that counter. After you parse all of m integers, walk through that array, and each sequence of zeroes in there will mean that the corresponding logarithmic interval is empty.

So the output of this will be the sequence of occupied intervals, logarithmically aligned to halves of power of c, aka 128, 181, 256, 363, 512 etc. This satisfies your requirements on precision for intervals' boundaries.

Update: You can also store lowest and highest number out of those that hit the interval. Once you do, the intervals' boundaries are calculated as follows:

Find the first nonzero counter from current position in the counters array. Take its lowest number as lower bound.
Progress through the counters array until you find a zero in the counter or hit the array's end. Take the last nonzero counter's highest number. This will be your upper bound for the current interval.
Proceed until full traversal of the array.
Return the set of intervals found. The boundaries will be strict.

An example: (abstract language code)

counters=[];
lowest=[];
highest=[];
for (i=0;i<m;i++) { 
    x=getNextInteger();
    n=Math.floor(2.0*logByBase(c,x));
    counters[n]++;
    if (counters[n]==1) {
        lowest[n]=x;
        highest[n]=x;
    } else {
        if (lowest[n]>x) lowest[n]=x;
        if (highest[n]<x) highest[n]=x;
    }
}
zeroflag=true; /// are we in mode of finding a zero or a nonzero
intervals=[];
currentLow=0;
currentHigh=0;
for (i=0;i<counters.length;i++) {
    if (zeroflag) {
        // we search for nonzero
        if (counters[i]>0) {
            currentLow=lowest[i]; // there's a value
            zeroflag=false;
        } // else skip
    } else {
        if (counters[i]==0) {
            currentHigh=highest[i-1]; // previous was nonzero, get its highest
            intervals.push([currentLow,currentHigh]); // store interval
            zeroflag=true;
    }
}
if (!zeroflag) { // unfinished interval 
    currentHigh=highest[counters.length-1];
    intervals.push([currentLow,currentHigh]); // store last interval
}

Question 3

You may want to look at aproximate median finding.

These methods can often be generalized to finding arbitrary quantiles with a reasonable precision; and quantiles are good for distributing your work load.

Here's my take, using two passes over the data set.

sample 10000 objects from your data set.
Solve your problem for the sample objects.
Re-scan your data set, assigning each object to the nearest interval from your sample, and tracking the minimum and maximum of each interval.

If your gaps are prominent enough, they should still be visible in the sample. The second pass is only to refine the interval boundaries.

Question 4

Split the total range into buckets. The bucket boundaries X_i should be spaced in an appropriate way, like e.g. linearly X_i=16*i. Other options would be quadratic spacing like X_i=4*i*i or logarithmic X_i=2^(i/16), here the total number of buckets would be smaller but finding the right bucket for a given number would be more effort. Each bucket is empty or non-empty, so one bit would be sufficient.
You iterate over the set of numbers, and for each number you mark its bucket as non-empty. Then the gaps between your intervals are represented by series of empty buckets. So now you find all sufficiently long series of empty buckets, and you have the interval gaps. The accuracy of the interval boundary will be determined by the bucket size, so assuming a bucket size of 16 you interval border is off by at most 15. If the max number is 2²⁶ and buckets are size 16 and you use one bit for each bucket you need 2¹⁹ byte, or 512kB of memory.