Calculating the “distance” between two unordered sets
-
18-06-2021 - |
Question
Assume two sets (unordered, no duplicate elements):
A = set(["z", "x", "c"])
B = set(["x", "z", "d", "e"])
These sets have two common elements: "z" and "x", and some set-specific elements: c, d, e.
How can you give each set a score, just like string-distance, while
- disregarding the ordering of elements and
- imposing the no-duplicate constraint for each isolated set
?
As you can see in the example, the size of each set can be different.
The non-critical requirements for this algorithm are:
- Insertion > Deletion (a set lacking an element implies higher cost, than the one that has one too many) if possible, or just INS = DEL
- Swap: 0 (no cost, since ordering has no effect on distance)
For now I have been calculating a set distance score:
score_A = len(common(a,b)) / len(a) # common(...) calculates intersection
score_B = len(common(a,b)) / len(b)
quadratic_score = sqrt(score_A * score_B)
How would you recommend approaching this problem or improving my solution?
Are there any algorithms that allow specification of costs?
Right now I am about to define a simple algebra for set modification:
def calculate_distance( a, b, insertion_cost=1, deletion_cost=1 ):
"""
Virtually, a programmer-friendly set-minus.
@return the distance from A to B, mind that this is not
a commutative operation.
"""
score = 0
for e in a:
if e not in b: # implies deletion from A
score += deletion_cost
for e in b:
if e not in a: # implies insertion into A
score += insertion_cost
return score
How can I normalize this value and against what?
Solution
This answer is of course out of date with respect to the question, but hopefully will be picked up by any future visitors.
Use the Jaccard distance, the cardinality (size of set) of the symmetric difference between the two sets divided by the cardinality of their union. In other terms, union minus intersection all divided by union.
This assumes that the elements can be compared in a discrete fashion, i.e. they are equal or not. A desirable property is that the Jaccard distance is a metric.
OTHER TIPS
How about the size of the set intersection over the size of the larger set? So:
float(len(A.intersection(B)))/max(len(A),len(B))
It'll give you a number scaled in the range 0.0 to 1.0 which is often desirable. 1.0 representing full equality, 0.0 representing nothing in common.
Similar question to this one
Assuming OP is asking something as the "distance", I think it's better to make it 0 when two sets are identical according to the general requirements of a distance function
And it would be also good to have symmetric and triangle inequality
symmetric is intuitive, and triangle inequality means d(A,C) ≤ d(A,B) + d(B,C)
I suggest something like:
C = A.intersection(B)
Distance = sqrt(len(A-C)*2 + len(B-C)*2)
However I don't know how to prove the triangle inequality yet
To normalize OP's updated function result, just do score = score / (len(a) + len(b))
which will give you 1 when a
doesn't intersect b
, and 0 when a == b