The question is asking for an implementation of the Set Cover Problem, for which no fast algorithm exists to find the optimal solution. However, the greedy solution to the problem - to repeatedly choose the subset that contains the most elements that have not been covered yet - does a good job in reasonable time.
You can find an implementation of that algorithm in python at this previous question
Edited to add: @Aaron Hall's answer can be improved by using the below drop-in replacement for his greedy_set_cover routine. In Aaron's code, we calculate a score len(s-result_set)
for every remaining subset each time we want to add a subset to the cover. However, this score will only decrease as we add to the result_set; so if on the current iteration we have picked out a best-so-far set with a score higher than the remaining subsets achieved in previous iterations, we know their score cannot improve and can just ignore them. That suggests using a priority queue to store the subsets for processing; in python we can implement the idea with heapq
:
# at top of file
import heapq
#... etc
# replace greedy_set_cover
@timer
def greedy_set_cover(subsets, parent_set):
parent_set = set(parent_set)
max = len(parent_set)
# create the initial heap. Note 'subsets' can be unsorted,
# so this is independent of whether remove_redunant_subsets is used.
heap = []
for s in subsets:
# Python's heapq lets you pop the *smallest* value, so we
# want to use max-len(s) as a score, not len(s).
# len(heap) is just proving a unique number to each subset,
# used to tiebreak equal scores.
heapq.heappush(heap, [max-len(s), len(heap), s])
results = []
result_set = set()
while result_set < parent_set:
logging.debug('len of result_set is {0}'.format(len(result_set)))
best = []
unused = []
while heap:
score, count, s = heapq.heappop(heap)
if not best:
best = [max-len(s - result_set), count, s]
continue
if score >= best[0]:
# because subset scores only get worse as the resultset
# gets bigger, we know that the rest of the heap cannot beat
# the best score. So push the subset back on the heap, and
# stop this iteration.
heapq.heappush(heap, [score, count, s])
break
score = max-len(s - result_set)
if score >= best[0]:
unused.append([score, count, s])
else:
unused.append(best)
best = [score, count, s]
add_set = best[2]
logging.debug('len of add_set is {0} score was {1}'.format(len(add_set), best[0]))
results.append(add_set)
result_set.update(add_set)
# subsets that were not the best get put back on the heap for next time.
while unused:
heapq.heappush(heap, unused.pop())
return results
For comparison, here's the timings for Aaron's code on my laptop. I dropped remove_redundant_subsets as when we use the heap, dominated subsets are never reprocessed anyway:
INFO:root:make_subsets function took 15800.697 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 463.478 ms
INFO:root:greedy_set_cover function took 32662.359 ms
INFO:root:len of results is 46
And here's the timing with the code above; a bit over 3 times faster.
INFO:root:make_subsets function took 15674.409 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 461.027 ms
INFO:root:greedy_pq_set_cover function took 8896.885 ms
INFO:root:len of results is 46
NB: these two algorithms process the subsets in different orders and will occasionally give different answers for the size of the set cover; this is just down to 'lucky' choices of subsets when scores are tied.
The priority queue/heap is a well known optimisation of the greedy algorithm, though I couldn't find a decent discussion of this to link to.
While the greedy algorithm is a quick-ish way to get an approximate answer, you can improve on the answer by spending time afterwards, knowing that we have an upper bound on the minimal set cover. Techniques to do so include simulated annealing, or branch-and-bound algorithms as illustrated in this article by Peter Norvig