Python : DIY generalize this "all_subsets" function to any size subsets

Question 1

You mention in a comment that the code here is opaque to you. But it's probably the best way of implementing a combinations function of the kind you're aiming for, and it's worth understanding, so I'll try to explain it in detail.

The basic idea is that given a sequence and a number of items to choose, we can represent each combination as a sequence of indices into the given sequence. So for example, say we have a list ['a', 'b', 'c', 'd', 'e'], and we want to generate all combinations of two values from that list.

Our first combination looks like this...

['a', 'b', 'c', 'd', 'e']
  ^    ^

...and is represented by the list of indices [0, 1]. Our next combination looks like this:

['a', 'b', 'c', 'd', 'e']
  ^         ^

And is represented by the list of indices [0, 2].

We keep moving the second caret forward, keeping the first in place, until the second caret reaches the end. Then we move the first caret to index 1 and "reset" the process by moving the second caret back to index 2.

['a', 'b', 'c', 'd', 'e']
       ^    ^

Then we repeat the process, moving the second caret forward until it reaches the end, and then moving the first forward by one and resetting the second.

Now we have to figure out how to do this by manipulating the list of indices. It turns out that this is quite simple. The final combination will look like this:

['a', 'b', 'c', 'd', 'e']
                 ^    ^

And the index representation of this will be [3, 4]. These are the maximum possible values for the indices, and are equal to i + n - r, where i is the position in the list, n is the number of values (5 in this case), and r is the number of choices (2 in this case). So as soon as a particular index reaches this value, it can go no higher, and will need to be "reset".

So with that in mind, here's a step-by-step analysis of the code:

def combinations(iterable, r):
    pool = tuple(iterable)
    n = len(pool)

First, given input based on the above example, pool would be is our list of characters above converted into a tuple, and n is simply the number of items in the pool.

if r > n:
    return

We can't select more than n items from an n item list without replacement, so we simply return in that case.

indices = range(r)

Now we have our indices, initialized to the first combination ([0, 1]). So we yield it:

yield tuple(pool[i] for i in indices)

Then we generate the remaining combinations, using an infinite loop.

while True:

Inside the loop, we first step backwards through the list of indices searching for an index that hasn't reached it's maximum value yet. We use the formula described above (i + n - r) to determine the maximum value for a given index. If we find an index that hasn't reached it's maximum value, then we break out of the loop.

    for i in reversed(range(r)):
        if indices[i] != i + n - r:
            break

If we don't find one, then that means that all the indices are at their maximum value, and so we're done iterating. (This uses the little-known for-else construct; the else block is executed only if the for loop terminates normally.)

    else:
        return

So now we know that index i needs to be incremented:

    indices[i] += 1

Additionally, the indices after i are all at their maximum values, and so need to be reset.

    for j in range(i+1, r):
        indices[j] = indices[j-1] + 1

Now we have the next set of indices, so we yield another combination.

    yield tuple(pool[i] for i in indices)

There are several variations on this approach; in another, instead of stepping backwards through the indices, you step forward, incrementing the first index that has a "gap" between it and the following index, and resetting the lower indices.

Finally, you could also define this recursively, although pragmatically, the recursive definition probably won't be as efficient.

Question 2

Instead of rolling out your own, you could use itertools.combinations().

Question 3

Then you are not doing Apriori.

In Apriori, you never enumerate all subsets of size k, except for k=1.

In any larger size, you construct the combinations according to Apriori-Gen.

That is much more efficient, and actually at least as easy as manually building all combinations.

Here is an example. Assuming the following itemsets were found frequent:

 ABCD
 ABCF
 ABEF
 ABDF
 ACDF
 BCDF

Then apriori will construct only one single candidate (by prefix rule!):

 ABC + D   - ABC + D + F
 ABC + F   /

And then it will next check whether the other subsets were also found frequent, i.e.

 BCDF
 ACDF
 ABDF

Since all of them were in the previous round, this candidate survives and will be tested in the next linear scan over the data set.

Apriori is all about not having to check all subsets of size k, but only those that have a chance to be frequent, given the previous knowledge.

Question 4

Okay here is my rolled own version :

def all_subsets(source,size):
        index = len(source)
        index_sets = [()]
        for sz in xrange(size):
                next_list = []
                for s in index_sets:
                        si = s[len(s)-1] if len(s) > 0 else -1
                        next_list += [s+(i,) for i in xrange(si+1,index)]
                index_sets = next_list
        subsets = []
        for index_set in index_sets:
                rev = [source[i] for i in index_set]
                subsets.append(rev)
        return subsets

Yields:

>>> Apriori.all_subsets(['c','r','i','s'],2)
[['c', 'r'], ['c', 'i'], ['c', 's'], ['r', 'i'], ['r', 's'], ['i', 's']]