Treap
A Treap is an example of a Balanced Binary Search Tree (you can use any of them for this problem). The expected height of a Treap containing n elements is O(logn) - expected, because Treap is a randomized data structure.
The following solution works for any Binary Search Tree, but it performs much better if the Balanced Binary Search Tree is used (e.g. Treap).
Measure
One measure of a similarity between two sets is the Jaccard Index. Let's call our sets A and B. The Jaccard Index is defined by:
So to compute the Jaccard Index of A and B, we must compute the sum and the intersection of A and B.
Operations
Let's assume that A and B are implemented as Balanced Binary Search Trees.
A Binary Search Tree could support many operations, but three of them are sufficient for this problem:
- find(x) - returns true if an only if x is in the Tree
- insert(x) - inserts x in the Tree, if x is not in the Tree before this operation
- size() - returns the number of elements in the Tree
In the Balanced Binary Search Tree, find(x) and insert(x) have O(logn) running time, where n is the number of elements in the Tree.
In addition, during insertion, we can keep track of the size of the Tree, so size() can be implemented in a constant time.
Of course, we could iterate over all elements of our Tree.
Pseudocode
Step 1.
sum(A, B):
C = A
foreach x in B:
C.insert(x)
return C
Step 2.
intersection(A, B):
C = new BalancedBinarySearchTree()
foreach x in B:
if(A.find(x) == true):
C.insert(x)
return C
Step 3.
Calculate the Jaccard index of A and B:
JaccardIndex(A, B)
S = sum(A, B)
I = intersect(A, B)
return I.size() / S.size()
Complexity
Let's assume that:
n = A.size()
m = B.size()
Then the complexity of computing the sum is O(n + m * log(n + m)), and the complexity of calculating the intersection is O(m * log n).