Question

I have a list of sets that I would like to sort into a partial order based on the subset relation.

In fact, I do not require the complete ordering, only the minimal elements.

If I am not mistaken, each minimal elements should define one separate component of the respective graph - and this component should be a meet-semilattice.

What would be the most convenient space and time efficient way to solve this problem? Perhaps there is a way that does not require to build the entire graph? Perhaps there is a known algorithm under a better terminology than what I have naively described above?

I am aware that the time and space requirements are underspecified above, but I would be happy about any suggestions, whether they are proven to be optimal or not...

Background: I am currently building an entire graph database that holds all the edges between the sets and then look for the nodes that have no generalizations, but this is quite complicated, slow and requires a lot of (disk) space. The list mentioned above contains roughly 100 million sets.

Was it helpful?

Solution

One approach is to sort the sets by increasing size, then repeatedly perform the following: take the first set in the list, output it, and remove from the list all supersets of it. This will output all of the minimal sets. The running time is $O(nk)$ set comparisons plus $O(n \log n)$ steps for sorting, where $n$ is the number of sets you have and $k$ is the number of minimal elements. Or, to put it another way, if each set contains $m$ elements, the running time will be approximately $O(n(k+\log n)m)$ basic steps.

Why sort by size? This is an optimization. The smallest set is guaranteed to be minimal (there is none of smaller cardinality in the list, so none of its subsets can be in the list), so size is a useful trick to identify a set that must surely be minimal.

Without sorting by size, the worst-case running time is likely to end up as $O(n^2)$ set comparisons (or $O(n^2 m)$ basic steps), which is worse when $k \ll n$.


Here is an optimized version of that algorithm. Let $M$ be a data structure that stores a set of sets, as a trie: for instance, the set $\{1,3,6,7\}$ corresponds to the word $1367$ and is stored in the trie accordingly. Initially, $M$ is empty. Repeat the following: take the next set $S$ from the list; check whether any set in $M$ is a subset of $S$; if not, insert $S$ into $M$; finally delete $S$ from the list (or advance your pointer to the next element in the list). The "check..." operation can be performed fairly efficiently using a recursive traversal of the trie. At the end, once you've gone through the entire list, output $M$.

The worst-case running time of the optimized algorithm remains the same. In practice the running time might be improved significantly, perhaps to as fast as $O(nm)$ basic steps in some cases if you are lucky (but don't count on it). You can try both and see which works better in practice on the kind of workloads you are dealing with.

OTHER TIPS

I have found a solution in this paper, p.12.

The algorithm mentioned there as proof should translate to the following python code:

T = set([]);
for x in X:
    rem = set([]);
    spe = False;
    for a in T:
        rel = oracle(x,a);
        if rel == "x>a":
            spe = True;
            break;
        elif rel == "x<a":
            rem.add(a);
    if not spe:
        T -= rem;
        T.add(x);

I expect the break to be crucial for actual runtime, so it might be good idea to sort X in advance to get early breaks -- but I am not sure about that.

Another point that I see here is that > is supposed to be irreflexive, so x>x does not hold. But for this code it would be better if it did. If a==x, it breaks instead of unnecessarily looking further.

UPDATE: I have now been able to test different implementations in Python. Please allow me to directly give the python code, I think it is sufficiently similar to Pseudocode -- and perhaps more concrete for many people.

Here is the implementation as taken from the paper:

def oracle(rep1,rep2):
    if generalizes(rep2,rep1):
        return ">";
    elif generalizes(rep1,rep2):
        return "<";
    return None;

def find_min_els_(repIDs,ID2rep):
    min_els = set([]);
    for x in repIDs:
        spec_of_x = set([]);
        x_is_spec = False;
        for min_el in min_els:
            relation = oracle(ID2rep[x],ID2rep[min_el]);
            if relation == ">":
                x_is_spec = True;
                break;
            elif relation == "<":
                spec_of_x.add(min_el);
        if not x_is_spec:
            min_els -= spec_of_x;
            min_els.add(x);
    return min_els;

Now this turned out to be too slow and we can already tell from the complexity that it is very bad if the width of the partial order, that is the number m of minimal elements is expected to be large.

The trick is to make this algorithm independent of m by avoiding going through all current minimal elements. Instead, we can make use of the fact that the lookup in the result set is fast (I guess this is where the trie comes into play).

For each x, we generate all generalizations. Now the complexity is dependent on the number of x and their size, but not so much on the number of minimal elements (only O(n log n)?). Even better, as we now have to remove non-minimal elements from the initial list of elements instead of adding them, the time used for each x is decreasing instead of increasing over runtime.

Here is the respective code:

def generalizations(spe_rep,gens):
    for el in spe_rep:
        gen_rep = spe_rep - set([el]);
        gen_str = string(gen_rep);
        if not gen_str in gens:
            gens.add(gen_str);
            yield gen_str;
            for x in generalizations(gen_rep,gens):
                yield x;

def find_min_els(repIDs,ID2rep):
    min_els = set(repIDs);
    for x in repIDs:
        for genID in generalizations(ID2rep[x],set([])):
            if genID in min_els:
                min_els.remove(x);
                break;
    return min_els;

This uses a generator function generalizations() to avoid computing more generalizations of x once one has already been found in the current minimal elements. This is already quite fast with my data, but it could perhaps be improved by generating generalizations first that are more general (it needs be tested if this makes it faster) and in particular by generating only generalizations that consist of elements that have already been observed in the current minimal elements. For example if our x is {a,b,c}, but no current minimal element has c in it, we do not have to generate any subset of x that contains c, i.e. only {a,b},{a},{b},{}.

Licensed under: CC-BY-SA with attribution
Not affiliated with cs.stackexchange
scroll top