Question

Assume two sets (unordered, no duplicate elements):

A = set(["z", "x", "c"])
B = set(["x", "z", "d", "e"])

These sets have two common elements: "z" and "x", and some set-specific elements: c, d, e.

How can you give each set a score, just like string-distance, while

  • disregarding the ordering of elements and
  • imposing the no-duplicate constraint for each isolated set

?

As you can see in the example, the size of each set can be different.

The non-critical requirements for this algorithm are:

  • Insertion > Deletion (a set lacking an element implies higher cost, than the one that has one too many) if possible, or just INS = DEL
  • Swap: 0 (no cost, since ordering has no effect on distance)

For now I have been calculating a set distance score:

score_A = len(common(a,b)) / len(a)    # common(...) calculates intersection
score_B = len(common(a,b)) / len(b)

quadratic_score = sqrt(score_A * score_B)

How would you recommend approaching this problem or improving my solution?

Are there any algorithms that allow specification of costs?


Right now I am about to define a simple algebra for set modification:

def calculate_distance( a, b, insertion_cost=1, deletion_cost=1 ):
    """
    Virtually, a programmer-friendly set-minus.

    @return     the distance from A to B, mind that this is not
                a commutative operation.
    """
    score = 0
    for e in a:
        if e not in b: # implies deletion from A
            score += deletion_cost

    for e in b:
        if e not in a: # implies insertion into A
            score += insertion_cost

    return score

How can I normalize this value and against what?

Was it helpful?

Solution

This answer is of course out of date with respect to the question, but hopefully will be picked up by any future visitors.

Use the Jaccard distance, the cardinality (size of set) of the symmetric difference between the two sets divided by the cardinality of their union. In other terms, union minus intersection all divided by union.

This assumes that the elements can be compared in a discrete fashion, i.e. they are equal or not. A desirable property is that the Jaccard distance is a metric.

OTHER TIPS

How about the size of the set intersection over the size of the larger set? So:

float(len(A.intersection(B)))/max(len(A),len(B))

It'll give you a number scaled in the range 0.0 to 1.0 which is often desirable. 1.0 representing full equality, 0.0 representing nothing in common.

Similar question to this one

Assuming OP is asking something as the "distance", I think it's better to make it 0 when two sets are identical according to the general requirements of a distance function

And it would be also good to have symmetric and triangle inequality

symmetric is intuitive, and triangle inequality means d(A,C) ≤ d(A,B) + d(B,C)

I suggest something like:

C = A.intersection(B)
Distance = sqrt(len(A-C)*2 + len(B-C)*2)

However I don't know how to prove the triangle inequality yet


To normalize OP's updated function result, just do score = score / (len(a) + len(b))

which will give you 1 when a doesn't intersect b, and 0 when a == b

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top