Quick/easy array comparison algorithm without sharing data

Question 1

I wouldn't try to reduce this to one number; just pass around a tuple of values, and write a close_enough function that compares the tuples.

For example, you could use (mean, stdev) as your value, and then define close_enough as "each array's mean is within 0.25 stdev of the other array's mean".

def mean_stdev(a):
    return mean(a), stdev(a)

def close_enough(mean_stdev_a, mean_stdev_b):
    mean_a, stdev_a = mean_stdev_a
    mean_b, stdev_b = mean_stdev_b
    diff = abs(mean_a - mean_b)
    return (diff < 0.25 * stdev_a and diff < 0.25 * stdev_b)

Obviously the right value is something you want to tune based on your use case. And maybe you actually want to base it on, e.g., variance (square of stdev), or variance and skew, or stdev and sqrt(skew), or some completely different normalization besides arithmetic mean. That all depends on what your numbers represent, and what "close enough" means.

Without knowing anything about your application area, it's hard to give anything more specific. For example, if you're comparing audio fingerprints (or DNA fingerprints, or fingerprint fingerprints), you'll want something very different from if you're comparing JPEG-compressed images of landscapes.

In your comment, you say you want to be sensitive to the order of the values. To deal with this, you can generate some measure of how "out-of-order" a sequence is. For example:

diffs = [elem[0] - elem[1] for elem in zip(seq, sorted(seq))]

This gives you the difference between each element and the element that would be there in sorted position. You can build a stdev-like measure out of this (square each value, average, sqrt), or take the mean absolute diff, etc.

Or you could compare how far away the actual index is from the "right" index. Or how far the value is from the value expected at its index based on the mean and stdev. Or… there are countless possibilities. Again, which is appropriate depends heavily on your application area.

Question 2

Depends entirely on your definition of "compare their similarities".

What features do you want to compare? What features can you identify? are their identifiable patterns? i.e in this set, there are 6 critical points, there are 2 discontinuities... etc...

You've already mentioned comparing the min/max/sum; and means and standard deviations have been talked about in comments too. These are all features of the set.

Ultimately, you should be able to take all these features and make an n-dimensional descriptor. For example [min, max, mean, std, etc...]

You can then compare these n-dimensional descriptors to define whether one is "less", "equal" or "more" than the other. If you want to classify other sets into whether they are more like "set A" or more like "set B", you could look into classifiers.

See:

Classifying High-Dimensional Patterns Using a Fuzzy Logic

Support Vector Machines