I have two arrays generated by two different systems that are independent of each other. I want to compare their similarities by comparing only a few numbers generated from the arrays.

Right now, I'm only comparing min, max, and sums of the arrays, but I was wondering if there is a better algorithm out there? Any type of hashing algorithm would need to be insensitive to small floating point differences between the arrays.

EDIT: What I'm trying to do is to verify that two algorithms generate the same data without having to compare the data directly. So the algorithm should be sensitive to shifts in the data and relatively insensitive to small differences between each element.

有帮助吗?

解决方案

I wouldn't try to reduce this to one number; just pass around a tuple of values, and write a close_enough function that compares the tuples.

For example, you could use (mean, stdev) as your value, and then define close_enough as "each array's mean is within 0.25 stdev of the other array's mean".

def mean_stdev(a):
    return mean(a), stdev(a)

def close_enough(mean_stdev_a, mean_stdev_b):
    mean_a, stdev_a = mean_stdev_a
    mean_b, stdev_b = mean_stdev_b
    diff = abs(mean_a - mean_b)
    return (diff < 0.25 * stdev_a and diff < 0.25 * stdev_b)

Obviously the right value is something you want to tune based on your use case. And maybe you actually want to base it on, e.g., variance (square of stdev), or variance and skew, or stdev and sqrt(skew), or some completely different normalization besides arithmetic mean. That all depends on what your numbers represent, and what "close enough" means.

Without knowing anything about your application area, it's hard to give anything more specific. For example, if you're comparing audio fingerprints (or DNA fingerprints, or fingerprint fingerprints), you'll want something very different from if you're comparing JPEG-compressed images of landscapes.


In your comment, you say you want to be sensitive to the order of the values. To deal with this, you can generate some measure of how "out-of-order" a sequence is. For example:

diffs = [elem[0] - elem[1] for elem in zip(seq, sorted(seq))]

This gives you the difference between each element and the element that would be there in sorted position. You can build a stdev-like measure out of this (square each value, average, sqrt), or take the mean absolute diff, etc.

Or you could compare how far away the actual index is from the "right" index. Or how far the value is from the value expected at its index based on the mean and stdev. Or… there are countless possibilities. Again, which is appropriate depends heavily on your application area.

其他提示

Depends entirely on your definition of "compare their similarities".

What features do you want to compare? What features can you identify? are their identifiable patterns? i.e in this set, there are 6 critical points, there are 2 discontinuities... etc...

You've already mentioned comparing the min/max/sum; and means and standard deviations have been talked about in comments too. These are all features of the set.

Ultimately, you should be able to take all these features and make an n-dimensional descriptor. For example [min, max, mean, std, etc...]

You can then compare these n-dimensional descriptors to define whether one is "less", "equal" or "more" than the other. If you want to classify other sets into whether they are more like "set A" or more like "set B", you could look into classifiers.

See:

Classifying High-Dimensional Patterns Using a Fuzzy Logic

Support Vector Machines

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top