Set Cover or Hitting Set; Numpy, Least element combinations to make up full set

Question 1

The question is asking for an implementation of the Set Cover Problem, for which no fast algorithm exists to find the optimal solution. However, the greedy solution to the problem - to repeatedly choose the subset that contains the most elements that have not been covered yet - does a good job in reasonable time.

You can find an implementation of that algorithm in python at this previous question

Edited to add: @Aaron Hall's answer can be improved by using the below drop-in replacement for his greedy_set_cover routine. In Aaron's code, we calculate a score len(s-result_set) for every remaining subset each time we want to add a subset to the cover. However, this score will only decrease as we add to the result_set; so if on the current iteration we have picked out a best-so-far set with a score higher than the remaining subsets achieved in previous iterations, we know their score cannot improve and can just ignore them. That suggests using a priority queue to store the subsets for processing; in python we can implement the idea with heapq:

# at top of file
import heapq
#... etc

# replace greedy_set_cover
@timer
def greedy_set_cover(subsets, parent_set):
    parent_set = set(parent_set)
    max = len(parent_set)
    # create the initial heap. Note 'subsets' can be unsorted,
    # so this is independent of whether remove_redunant_subsets is used.
    heap = []
    for s in subsets:
        # Python's heapq lets you pop the *smallest* value, so we
        # want to use max-len(s) as a score, not len(s).
        # len(heap) is just proving a unique number to each subset,
        # used to tiebreak equal scores.
        heapq.heappush(heap, [max-len(s), len(heap), s])
    results = []
    result_set = set()
    while result_set < parent_set:
        logging.debug('len of result_set is {0}'.format(len(result_set)))
        best = []
        unused = []
        while heap:
            score, count, s = heapq.heappop(heap)
            if not best:
                best = [max-len(s - result_set), count, s]
                continue
            if score >= best[0]:
                # because subset scores only get worse as the resultset
                # gets bigger, we know that the rest of the heap cannot beat
                # the best score. So push the subset back on the heap, and
                # stop this iteration.
                heapq.heappush(heap, [score, count, s])
                break
            score = max-len(s - result_set)
            if score >= best[0]:
                unused.append([score, count, s])
            else:
                unused.append(best)
                best = [score, count, s]
        add_set = best[2]
        logging.debug('len of add_set is {0} score was {1}'.format(len(add_set), best[0]))
        results.append(add_set)
        result_set.update(add_set)
        # subsets that were not the best get put back on the heap for next time.
        while unused:
            heapq.heappush(heap, unused.pop())
    return results

For comparison, here's the timings for Aaron's code on my laptop. I dropped remove_redundant_subsets as when we use the heap, dominated subsets are never reprocessed anyway:

INFO:root:make_subsets function took 15800.697 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 463.478 ms
INFO:root:greedy_set_cover function took 32662.359 ms
INFO:root:len of results is 46

And here's the timing with the code above; a bit over 3 times faster.

INFO:root:make_subsets function took 15674.409 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 461.027 ms
INFO:root:greedy_pq_set_cover function took 8896.885 ms
INFO:root:len of results is 46

NB: these two algorithms process the subsets in different orders and will occasionally give different answers for the size of the set cover; this is just down to 'lucky' choices of subsets when scores are tied.

The priority queue/heap is a well known optimisation of the greedy algorithm, though I couldn't find a decent discussion of this to link to.

While the greedy algorithm is a quick-ish way to get an approximate answer, you can improve on the answer by spending time afterwards, knowing that we have an upper bound on the minimal set cover. Techniques to do so include simulated annealing, or branch-and-bound algorithms as illustrated in this article by Peter Norvig

Question 2

Thanks for the question, I found it very interesting. I have tested the below code on Python 2.6, 2.7, and 3.3, you may find it interesting to run it yourself, I made it easy to paste into an interpreter or to run as a script.

Another solution here attempts to solve by brute force, i.e. going through every possible combination, which may be doable for ten elements, which the questioner gave as an example, but won't offer a solution for the parameters the questioner asked for, i.e., choosing a combination of subsets (up to 1500 elements long, from a superset of 15000 elements) from a set of 30,000 sets. I found for those parameters, attempting to find a solution set where n = 40 (very unlikely) would mean searching many orders of combinations over one googol, which is quite impossible.

The Setup

Here I import some modules used to benchmark my functions and create the data. I also create a timer decorator to wrap the functions so I can easily measure how much time passes before the function completes (or I give up and interrupt the function).

import functools
import time
import logging
import random

# basic setup:
logging.basicConfig(level=logging.DEBUG) # INFO or DEBUG
random.seed(1)
PARENT_SIZE = 15000
MAX_SUBSET_SIZE = 1500
N_SUBSETS = 30000

def timer(f):
    '''
    timer wrapper modified, original obtained from:
    http://stackoverflow.com/questions/5478351/python-time-measure-function
    '''
    @functools.wraps(f)
    def wrap(*args):
        time1 = time.time()
        try:
            ret = f(*args)
        except KeyboardInterrupt:
            time2 = time.time()
            logging.info('{0} function interrupted after {1:.3f} ms'.format(f.__name__, (time2-time1)*1000.0))
        else:
            time2 = time.time()
            logging.info('{0} function took {1:.3f} ms'.format(f.__name__, (time2-time1)*1000.0))
        return ret
    return wrap

Data Creation Functions

Next I have to create the data:

@timer
def make_subsets(parent_set, n):
    '''create list of subset sets, takes about 17 secs'''
    subsets = []
    for i in range(n): # use xrange in python 2
        subsets.append(set(random.sample(parent_set, random.randint(1, MAX_SUBSET_SIZE))))
    return subsets


@timer
def include_complement(parent_set, subsets):
    '''ensure no missing elements from parent, since collected randomly'''
    union_subsets = set().union(*subsets)
    subsets_complement = set(parent_set) - union_subsets
    logging.info('len of union of all subsets was {0}'.format(
                                          len(union_subsets)))
    if subsets_complement:
        logging.info('len of subsets_complement was {0}'.format(
                                          len(subsets_complement)))
        subsets.append(subsets_complement)
    return subsets

Optional Preprocessing

I provide some preprocessing, it runs in a few seconds, but doesn't help much, only speeds up by a fraction of a second, but recorded here for the reader's edification:

@timer
def remove_redundant_subsets(subsets):
    '''
    without break, takes a while, removes 81 sets of len <= 4 (seed(0))
    in 5.5 minutes, so breaking at len 10 for 4 second completion.
    probably unnecessary if truly random subsets
    but *may* be good if large subsets are subsets of others.
    '''
    subsets.sort(key=len)
    remove_list = []
    for index, s in enumerate(subsets, 1):
        if len(s) > 10: # possible gain not worth continuing farther
            break
        if any(s.issubset(other) for other in subsets[index:]):
            logging.debug('will remove subset: {s}'.format(s=s))
            remove_list.append(s)
    logging.info('subsets removing: {0}'.format(len(remove_list)))
    for s in remove_list:
        subsets.remove(s)
    return subsets

Actual Function

Then I actually perform the Greedy Algorithm:

@timer
def greedy_set_cover(subsets, parent_set):
    parent_set = set(parent_set)
    results = []
    result_set = set()
    while result_set < parent_set:
        logging.debug('len of result_set is {0}'.format(len(result_set)))
        # maybe room for optimization here: Will still have to calculate.
        # But custom max could shortcut subsets on uncovered more than len.
        add_set = max(subsets, key=lambda x: len(x - result_set))
        logging.debug('len of add_set is {0}'.format(len(add_set)))
        results.append(add_set)
        result_set.update(add_set)
    return results

Here's the main():

# full set, use xrange instead of range in python 2 for space efficiency    
parent_set = range(PARENT_SIZE) 
subsets = make_subsets(parent_set, N_SUBSETS)
logging.debug(len(subsets))
subsets = include_complement(parent_set, subsets) # if necessary
logging.debug(len(subsets))
subsets = remove_redundant_subsets(subsets)
logging.debug(len(subsets))
results = greedy_set_cover(subsets, parent_set)
logging.info('len of results is {0}'.format(len(results)))
for i, set in enumerate(results, 1):
    logging.debug('len of set {0} is {1}'.format(i, len(set)))

Final Results

And this provides a final result of 46(ish) subsets just over 3 minutes given the original parameters the questioner gave, ran in Python 2.

Here's the output for seed(0):

INFO:root:make_subsets function took 17158.725 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 2716.381 ms
INFO:root:subsets removing: 81
INFO:root:remove_redundant_subsets function took 3319.620 ms
INFO:root:greedy_set_cover function took 188026.052 ms
INFO:root:len of results is 46

And here's the output for seed(1):

INFO:root:make_subsets function took 17538.083 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 2414.091 ms
INFO:root:subsets removing: 68
INFO:root:remove_redundant_subsets function took 3218.643 ms
INFO:root:greedy_set_cover function took 189019.275 ms
INFO:root:len of results is 47

This was a lot of fun, thanks for the problem.

PS: I decided to try benchmarking the naive brute-force approach:

INFO:root:make_subsets function took 17984.412 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 2412.666 ms
INFO:root:foo function interrupted after 3269064.913 ms

Naturally, I interrupted it, as it would not ever get close in my lifetime, perhaps our Sun's lifetime?:

>>> import math
>>> def combinations(n, k):
...     return math.factorial(n)/(math.factorial(k)*math.factorial(n-k))
... 
>>> combinations(30000, 40)
145180572634248196249221943251413238587572515214068555166193044430231638603286783165583063264869418810588422212955938270891601399250L

Question 3

Here's a solution using itertools.combinations to iterate over various combinations of the subsets, and union(*x) to combine them.

import itertools
subsets = [a,b,c,d,e,f]
def foo(A, subsets):
    found = []
    for n in range(2,len(subsets)):
        for x in itertools.combinations(subsets, n):
            u =  set().union(*x)
            if A==u:
                found.append(x)
        if found:
            break
    return found
print foo(A,subsets)

produces:

[(set([1, 2, 3]), set([4, 5, 6, 7]), set([8, 9, 10, 5])), 
 (set([1, 2, 3, 4]), set([4, 5, 6, 7]), set([8, 9, 10, 5]))]

For this example it runs a bit faster than your code, though if I expand it to keep track of the subset names it runs a bit slower. But this is a small example, so timings don't mean much. (edit - as demonstrated in the other answer, this approach slows down considerably with larger problems).

numpy isn't going help, since we are not dealing with array or parallel operations. As others write it is basically a search problem. You can speed up internal steps, and try to prune away deadends, but you can't get away from trying many alternatives.

The usual way to do searches in numpy is to construct a matrix of all combinations, and then pull out the desired ones with something like sum, min or max. It's a brute force approach that takes advantage of fast compiled operations on arrays.