Question

Given a set of N elements colored with C colors, how can I find every possible combination of length L that contains no more than a maximum of M colors?

I tried this algorithm that uses itertools.combinations to generate all the possible combinations, and then filter out those that do not hold the maximum colors condiction.

from itertools import combinations as cb

def allowed_combinations(elements, combination_size=4, max_colors=3):

    colors = set([c for k, c in elements.items()])
    combinations = cb(elements, combination_size)
    for combination in combinations:
        colors = set([elements[element] for element in combination])
        if len(colors) > max_colors:
            continue
        yield combination


elements = dict()
elements['A'] = 'red'
elements['B'] = 'red'
elements['C'] = 'blue'
elements['D'] = 'blue'
elements['E'] = 'green'
elements['F'] = 'green'
elements['G'] = 'green'
elements['H'] = 'yellow'
elements['I'] = 'white'
elements['J'] = 'white'
elements['K'] = 'black'

combinations = allowed_combinations(elements)

for c in combinations:
    for element in c:
        print("%s-%s" % (element, elements[element]))
    print "\n"

the output is like:

A-red
C-blue
B-red
E-green


A-red
C-blue
B-red
D-blue


A-red
C-blue
B-red
G-green


A-red
C-blue
B-red
F-green

...

The problem is that generating all possible combinations can be computationally very expensive. In my case, for instance, L is often 6 and the number of elements N is around 50, so it gives us Bin(50,6) = 15890700 possible combinations. If maximum number of colors allowed in a comination is small, most of combinations are "useless" and so they are discarded in the filter step. My intuition is that I should put the filtering step inside/before the combinatory step, to avoid the explotion of combinations, but I don't see how.

Was it helpful?

Solution 2

Combinatorial problems are notorious for being easy to state but possibly difficult to solve. For this one, I wouldn't use itertools at all, but instead write a custom generator. For example,

def combs(elt2color, combination_size=4, max_colors=3):

    def inner(needed, index):
        if needed == 0:
            yield result
            return
        if n - index < needed:
            # not enough elements remain to reach
            # combination_size
            return
        # first all results that don't contain elts[index]
        for _ in inner(needed, index + 1):
            yield result
        # and then all results that do contain elts[index]
        needed -= 1
        elt = elts[index]
        color = elt2color[elt]
        color_added = color not in colors_seen
        colors_seen.add(color)
        if len(colors_seen) <= max_colors:
            result[needed] = elt
            for _ in inner(needed, index + 1):
                yield result
        if color_added:
            colors_seen.remove(color)

    elts = tuple(elt2color)
    n = len(elts)
    colors_seen = set()
    result = [None] * combination_size
    for _ in inner(combination_size, 0):
        yield tuple(result)

Then:

elt2color = dict([('A', 'red'), ('B', 'red'), ('C', 'blue'),
                  ('D', 'blue'), ('E', 'green'), ('F', 'green'),
                  ('G', 'green'), ('H', 'yellow'), ('I', 'white'),
                  ('J', 'white'), ('K', 'black')])
for c in combs(elt2color):
    for element in c:
        print("%s-%s" % (element, elements[element]))
    print "\n"

produces the same 188 combinations as your post-processing code, but internally abandons a partial combination as soon as it would span more than max_colors colors. There's no way to change what itertools functions do internally, so when you want control over that, you need to roll your own.

Using itertools

Here's another approach, generating first all solutions with exactly 1 color, then exactly 2 colors, and so on. itertools can be used directly for much of this, but at the lowest level still needs a custom generator. I find this harder to understand than a fully custom generator, but it may be clearer to you:

def combs2(elt2color, combination_size=4, max_colors=3):
    from collections import defaultdict
    from itertools import combinations
    color2elts = defaultdict(list)
    for elt, color in elt2color.items():
        color2elts[color].append(elt)

    def at_least_one_from_each(iterables, n):
        if n < len(iterables):
            return # impossible
        if not n or not iterables:
            if not n and not iterables:
                yield ()
            return
        # Must have n - num_from_first >= len(iterables) - 1,
        # so num_from_first <= n - len(iterables) + 1
        for num_from_first in range(1, min(len(iterables[0]) + 1,
                                           n - len(iterables) + 2)):
            for from_first in combinations(iterables[0],
                                           num_from_first):
                for rest in at_least_one_from_each(iterables[1:],
                                             n - num_from_first):
                    yield from_first + rest

    for numcolors in range(1, max_colors + 1):
        for colors in combinations(color2elts, numcolors):
            # Now this gets tricky.  We need to pick
            # combination_size elements across all the colors, but
            # must pick at least one from each color.
            for elements in at_least_one_from_each(
                    [color2elts[color] for color in colors],
                    combination_size):
                yield elements

I haven't timed these, because I don't care ;-) The fully custom generator's single result list is reused for building each output, which slashes the rate of dynamic memory turnover. The second way creates a lot of memory churn by pasting together multiple levels of from_first and rest tuples - and that's mostly unavoidable because it uses itertools to generate the from_first tuples at each level.

Internally, itertools functions almost always work in a way more similar to the first code sample, and for the same reasons, reusing an internal buffer as much as possible.

AND ONE MORE

This is more to illustrate some subtleties. I thought about what I'd do if I were to implement this functionality in C as an itertools function. All the itertools functions were first prototyped in Python, but in a semi-low-level way, reduced to working with vectors of little integers (no "inner loop" usage of sets, dicts, sequence slicing, or pasting together partial result sequences - sticking as far as possible to O(1) worst-case time operations on dirt simple native C types after initialization).

At a higher level, an itertools function for this would accept any iterable as its primary argument, and almost certainly guarantee to return combinations from that in lexicographic index order. So here's code that does all that. In addition to the iterable argument, it also requires an elt2ec mapping, which maps each element from the iterable to its equivalence class (for you, those are strings naming colors, but any objects usable as dict keys could be used as equivalence classes):

def combs3(iterable, elt2ec, k, maxec):
    # Generate all k-combinations from `iterable` spanning no
    # more than `maxec` equivalence classes.
    elts = tuple(iterable)
    n = len(elts)
    ec = [None] * n  # ec[i] is equiv class ordinal of elts[i]
    ec2j = {} # map equiv class to its ordinal
    for i, elt in enumerate(elts):
        thisec = elt2ec[elt]
        j = ec2j.get(thisec)
        if j is None:
            j = len(ec2j)
            ec2j[thisec] = j
        ec[i] = j
    countec = [0] * len(ec2j)
    del ec2j

    def inner(i, j, totalec):
        if i == k:
            yield result
            return
        for j in range(j, jbound[i]):
            thisec = ec[j]
            thiscount = countec[thisec]
            newtotalec = totalec + (thiscount == 0)
            if newtotalec <= maxec:
                countec[thisec] = thiscount + 1
                result[i] = j
                yield from inner(i+1, j+1, newtotalec)
                countec[thisec] = thiscount

    jbound = list(range(n-k+1, n+1))
    result = [None] * k
    for _ in inner(0, 0, 0):
         yield (elts[i] for i in result)

(Note that this is Python 3 code.) As advertised, nothing in inner() is fancier than indexing a vector with a little integer. The only thing remaining to make it directly translatable to C is removing the recursive generation. That's tedious, and since it wouldn't illustrate anything particularly interesting here I'm going to ignore that.

Anyway, the interesting thing is timing it. As noted in a comment, timing results are strongly influenced by the test cases you use. combs3() here is sometimes fastest, but not often! It's almost always faster than my original combs(), but usually slower than my combs2() or @GarethRees's lovely constrained_combinations().

So how can that be when combs3() has been optimized "almost all the way down to mindless ;-) C-level operations"? Easy! It's still written in Python. combs2() and constrained_combinations() use the C-coded itertools.combinations() to do much of their work, and that makes a world of difference. combs3() would run circles around them if it were coded in C.

Of course any of these can run unboundedly faster than the allowed_combinations() in the original post - but that one can be fastest too (for example, pick just about any inputs where max_colors is so large that no combinations are excluded - then allowed_combinations() wastes little effort, while all these others add extra substantial extra overheads to "optimize" pruning that never occurs).

OTHER TIPS

Here's an implementation that's a bit simpler than the other answers posted so far. The basic approach is to:

  1. Pick a value ("colour" in your terminology) that has not been picked so far;
  2. Loop over i, the number of keys ("elements") associated with that value that will be included in the output;
  3.   Loop over c, the combinations of those keys of length i;
  4.     Recurse to pick the next value.
from collections import defaultdict, deque
from itertools import combinations

def constrained_combinations(elements, r, s):
    """Generate distinct combinations of 'r' keys from the dictionary
    'elements' using at most 's' different values. The values must be
    hashable.

        >>> from collections import OrderedDict
        >>> elements = OrderedDict(enumerate('aabbc'))
        >>> cc = constrained_combinations
        >>> list(cc(elements, 2, 1))
        [(0, 1), (2, 3)]
        >>> list(cc(elements, 3, 2))
        [(0, 1, 2), (0, 1, 3), (0, 1, 4), (0, 2, 3), (1, 2, 3), (2, 3, 4)]
        >>> list(cc(elements, 3, 3)) == list(combinations(range(5), 3))
        True
        >>> sum(1 for _ in cc(OrderedDict(enumerate('aabbcccdeef')), 4, 3))
        188

    """
    # 'value_keys' is a map from value to a list of keys associated
    # with that value; 'values' is a list of values in reverse order of
    # first appearance.
    value_keys = defaultdict(list)
    values = deque()
    for k, v in elements.items():
        if v not in value_keys:
            values.appendleft(v)
        value_keys[v].append(k)

    def helper(current, r, s):
        if r == 0:
            yield current
            return
        if s == 0 or not values:
            return
        value = values.pop()
        keys = value_keys[value]
        for i in range(min(r, len(keys)), -1, -1):
            for c in combinations(keys, i):
                for result in helper(current + c, r - i, s - min(i, 1)):
                    yield result
        values.append(value)

    return helper((), r, s)

Notes

  1. In Python 3.3 or later, you could use the yield from statement to simplify the recursive call:

    yield from helper(current + c, r - i, s - min(i, 1))
    
  2. If you're wondering why the doctests use collections.OrderedDict, it's so that the combinations can be returned in a predictable order, which is necessary for the tests to work.

  3. The code reverses the list values, and iterates downwards over i so that if the caller passes in an OrderedDict, the combinations are returned in a sensible order (with values that appear early in the input also appearing early in the output).

  4. Given the slight awkwardness in getting predictable output from this function, it would, I think, be worth considering changing the interface so that instead of taking a dictionary mapping keys to values, it would take an iterable of (key, value) pairs.

Performance

This is broadly similar in speed to Tim Peter's combs2:

>>> from timeit import timeit
>>> elements = dict(enumerate('abcde' * 10))
>>> test = lambda f:timeit(lambda:sum(1 for _ in f(elements, 6, 3)), number=1)
>>> test(combs2)
11.403807007009163
>>> test(constrained_combinations)
11.38378801709041

Rough outline.

You have in total C different colors. For each k, 1 <= k <= M, choose k colors in Bin(C,k) ways. (I'm using your notation here assuming Bin mean binomial coefficient).

For each of the above choices, collect all the elements with the chosen colors. Let's say it gives P distinct elements. Then choose L from these P elements in Bin(P, L) different ways.

All of the above subject to obvious checks, M <= C, L <= P, etc.

The advantage of this approach is that it will generate only valid combinations and every valid combination will be generated exactly once. (edit: and as pointed out in a comment, this is not true duplicate, combination can be generated).

PS. And here's an implementation of the above algorithm, with the fix for duplicated combinations:

from itertools import combinations


elts  = { 'A' : 'red', 'B' : 'red', 'C' : 'blue', 'D' : 'blue',
          'E': 'green', 'F' : 'green', 'G' : 'green', 'H' : 'yellow',
          'I' : 'white', 'J' : 'white', 'K' : 'black' }

def combs (elts, size = 4, max_colors = 3):
    # Count different colors
    colors = {}
    for e in elts.values():
        colors [e] = 1
    ncolors = len(colors)

    # for each different number of colors between 1 and 'max_colors' 
    for k in range (1, max_colors + 1):
        # Choose 'k' different colors
        for selected_colors in combinations (colors, k):
            # Select ell the elements with these colors
            selected_elts = []
            for e, c in elts.items():
                if c in selected_colors:
                    selected_elts.append (e)
            # Choose 'size' of these elements
            for chosen_elts in combinations (selected_elts, size):
                # Check the chosen elements are of exactly 'k' different colors
                t = {}
                for e in chosen_elts:
                    t[elts[e]] = 1
                if len(t) == k:
                    yield chosen_elts


#for e in combs (elts):
#    print (e)

print (len (list (combs (elts))))

PS. I also timed Tim's comb2, my own comb and Gareth's constrained_combinations with the program here with these results:

combs2 =  5.214529
constr combs = 5.290079
combs = 4.952063
combs2 = 5165700
constr combs = 5165700
combs = 5165700
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top