Algorithm for merging sets that share at least 2 elements

https://stackoverflow.com/questions/312912

10-07-2019
|

Question

Given a list of sets:

S_1 : [ 1, 2, 3, 4 ]
S_2 : [ 3, 4, 5, 6, 7 ]
S_3 : [ 8, 9, 10, 11 ]
S_4 : [ 1, 8, 12, 13 ]
S_5 : [ 6, 7, 14, 15, 16, 17 ]

What the most efficient way to merge all sets that share at least 2 elements? I suppose this is similar to a connected components problem. So the result would be:

[ 1, 2, 3, 4, 5, 6, 7, 14, 15, 16, 17] (S_1 UNION S_2 UNION S_5)
[ 8, 9, 10, 11 ]
[ 1, 8, 12, 13 ] (S_4 shares 1 with S_1, and 8 with S_3, but not merged because they only share one element in each)

The naive implementation is O(N^2), where N is the number of sets, which is unworkable for us. This would need to be efficient for millions of sets.

Solution

Let there be a list of many Sets named (S)

Perform a pass through all elements of S, to determine the range (LOW .. HIGH).

Create an array of pointer to Set, of dimensions (LOW, HIGH), named (M).

do
    Init all elements of M to NULL.   

    Iterate though S, processing them one Set at a time, named (Si).

        Permutate all ordered pairs in Si. (P1, P2) where P1 <= P2.
        For each pair examine M(P1, P2)
            if M(P1, P2) is NULL
                Continue with the next pair.
            otherwise
                Merge Si, into the Set pointed to by, M(P1, P2).
                Remove Si from S, as it has been merged.
                Move on to processing Set S(i + 1)

        If Si was not merged, 
            Permutate again through Si
            For each pair, make M(P1, P2) point to Si.

while At least one set was merged during the pass.

My head is saying this is about Order (2N ln N). Take that with a grain of salt.

OTHER TIPS

If you can order the elements in the set, you can look into using Mergesort on the sets. The only modification needed is to check for duplicates during the merge phase. If one is found, just discard the duplicate. Since mergesort is O(n*log(n)), this will offer imrpoved speed when compared to the naive O(n^2) algorithm.

However, to really be effective, you should maintain a sorted set and keep it sorted, so that you can skip the sort phase and go straight to the merge phase.

I don't see how this can be done in less than O(n^2).

Every set needs to be compared to every other one to see if they contain 2 or more shared elements. That's n*(n-1)/2 comparisons, therefore O(n^2), even if the check for shared elements takes constant time.

In sorting, the naive implementation is O(n^2) but you can take advantage of the transitive nature of ordered comparison (so, for example, you know nothing in the lower partition of quicksort needs to be compared to anything in the upper partition, as it's already been compared to the pivot). This is what result in sorting being O(n * log n).

This doesn't apply here. So unless there's something special about the sets that allows us to skip comparisons based on the results of previous comparisons, it's going to be O(n^2) in general.

Paul.

One side note: It depends on how often this occurs. If most pairs of sets do share at least two elements, it might be most efficient to build the new set at the same time as you are stepping through the comparison, and throw it away if they don't match the condition. If most pairs do not share at least two elements, then deferring the building of the new set until confirmation of the condition might be more efficient.

If your elements are numerical in nature, or can be naturally ordered (ie. you can assign a value such as 1, 2, 42 etc...), I would suggest using a radix sort on the merged sets, and make a second pass to pick up on the unique elements.

This algorithm should be of O(n), and you can optimize the radix sort quite a bit using bitwise shift operators and bit masks. I have done something similar for a project I was working on, and it works like a charm.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow