fast similarity detection

https://stackoverflow.com/questions/1889065

19-09-2019
|

Question

I have a large collection of objects and I need to figure out the similarities between them.

To be exact: given two objects I can compute their dissimilarity as a number, a metric - higher values mean less similarity and 0 means the objects have identical contents. The cost of computing this number is proportional to the size of the smaller object (each object has a given size).

I need the ability to quickly find, given an object, the set of objects similar to it.

To be exact: I need to produce a data structure that maps any object o to the set of objects no more dissimilar to o than d, for some dissimilarity value d, such that listing the objects in the set takes no more time than if they were in an array or linked list (and perhaps they actually are). Typically, the set will be very much smaller than the total number of objects, so it is really worthwhile to perform this computation. It's good enough if the data structure assumes a fixed d, but if it works for an arbitrary d, even better.

Have you seen this problem before, or something similar to it? What is a good solution?

To be exact: a straightforward solution involves computing the dissimilarities between all pairs of objects, but this is slow - O(n²) where n is the number of objects. Is there a general solution with lower complexity?

Solution

Without knowing more details of the metric, it's hard to say. I don't have any ideas for eliminating the O(n^2) aspect, but there may be a way to reduce some of the constants involved. For example, if you had a Euclidean metric d(p,q) = sqrt( (p_1-q_1)^2 + ..+ (p_n-q_n)^2), you could square your distance d and compare it to the partial sums of (p_i-q_i)^2 and stop when you exceed d^2.

Whether this will actually save you time depends on how expensive the compare is to just calculating the summands and how many summand calculations you could expect to avoid by doing this (obviously, the smaller d is, the better).

OTHER TIPS

I need to produce a data structure that maps any object o to the set of objects no more dissimilar to o than d, for some dissimilarity value d.

It might be fastest to just abandon the similarity computation when the subtotal becomes larger than d. For example, if your similarities are based on cosine or hausdorff distances this can easily be done.

PS: if this cannot be done, your problem might be related to the k-nearest neighbors problem (or more precise a nearest neighbor problem with a threshold neighborhood). You should look for algorithms that find close-by members without computing all distances (maybe something using triangle inequality). Wikipedia should help you to explore suitable algorithms.

If your similarity measure is transitive, you don't have to compute the similarity for all pairs of objects since for objects a, b, c:

similarity(a,c) = similarity(a,b) op similarity(b,c)

where op is a binary operator e.g. multiplication or addition.

I think the solution depends on a lot more detail about the nature of your problem.

Do you need to find the similar objects for the same object many times, or only once? If it's many times, then creating a data structure where you compute the difference once for each pair and then connect objects to similar objects so that you can retrieve the list quickly without recalculation might be a very useful performance enhancement.
What is the nature of the calculation? At one extreme, if the nature of the difference is that it is, for example, the difference in height between two people, then maintaining the list sorted by height would let you find the similar objects very quickly. I'm assuming the real problem is more complicated than that, but following on that logic, if the difference is the sum of several linear quantities, you could create a multi-dimenstional array, and then conceptually imagine the set of similar objects as those within an n-dimensional sphere (i.e. circle, sphere, hypersphere, etc) centered around the reference object, and again find them directly. Actually it occurs to me that if the radius calculations are too complicated or take too much run-time, a good approximation would be to create an n-dimensional cube (i.e. square, cube, tesseract, etc) around the reference object, retrieve all objects which lie within that cube as "candidates", and then just do the actual computation on the candidates.

For example, suppose the "difference" is the sum of the absolute values of the differences of three attributes, say a1, a2, and a3. You could create a 3-dimensional array and set the value of each node of the array to the object with those values, if any. Then if you want to find all objects with difference less than d from object o, you could write:

for (x1=o.a1-d;x1<o.a1+d;++x1)
{
  for (x2=o.a2-d;x1<o.a2+d;++x2)
  {
    for (x3=o.a3-d;x1<o.a3+d;++x3)
    {
      if (array[x1][x2][x3]!=null
        && (abs(x1-o.a1)+abs(x2-o.a2)+abs(x3-o.a3)<=d)
        {
          ... found a match ...
        }
    }
  }
}

I suspect that the difference rules are more complicated than that, but fine, just add sophistication to the alrorithm to match the complexity of the rules. The point is to use the array to limit the set of objects that you have to examine.

Again on the nature of the calculation: If one of the elements making up the difference, or some small subset, tends to be more significant than others, then create a data structure that allows you to quickly compare for this within range. If it is in range, do the full compare. If not, then you don't even look at it.

Is it not possible to use a kd-tree?

It may be necessary (if possible) to normalize the dimensions. Afterwards, you just need to populate the tree, and use a "nearest N neighbors" search, and try to find any object within some range.

Example of objects: Images, Documents. Of course working with the raw representation of these objects is mostly not useful. usually one would pre-process the raw form and turn it into some normalized form (for documents, say a vector for which each entry represents the number/percent of times a certain word appeared, for images it could be a representation of visual features found in the image).

if d is fixed and a n^2 pre-computation is feasible, you could just use a graph representation using a linked list for each object for example. You can have more efficient solutions on the expense of accuracy using approximate nearest neighbors algorithms.

Can we assume that similarity is transitive, ie. diff(a,c) == diff(a,b) + diff(b,c)? If so, you can try the following:

Sort the collection of objects. If the object similarity metric doesn't have a decent absolute value, you can arbitrarily select one object as "zero" and sort all other objects by their similarity to that object.
To find the objects with similarity s to o, find o in the sorted list, and search to the left and to the right until the diff grows larger than s.

The advantage of this is that the sorting can be done once, and subsequent set building is proportional to the number of members that will be in the set.

Sounds like BK-Tree. Here is a small example. You basically create tree and check which branch should be used for similar object search and which not, so you prevent O(n2)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow