Are there any cleverly efficient algorithms to perform a calculation over the space of partitionings of a string?

https://stackoverflow.com/questions/1223007

11-07-2019
|

Question

I'm working on a statistical project that involves iterating over every possible way to partition a collection of strings and running a simple calculation on each. Specifically, each possible substring has a probability associated with it, and I'm trying to get the sum across all partitions of the product of the substring probability in the partition.

For example, if the string is 'abc', then there would be probabilities for 'a', 'b', 'c', 'ab, 'bc' and 'abc'. There are four possible partitionings of the string: 'abc', 'ab|c', 'a|bc' and 'a|b|c'. The algorithm needs to find the product of the component probabilities for each partitioning, then sum the four resultant numbers.

Currently, I've written a python iterator that uses binary representations of integers for the partitions (eg 00, 01, 10, 11 for the example above) and simply runs through the integers. Unfortunately, this is immensely slow for strings longer than 20 or so characters.

Can anybody think of a clever way to perform this operation without simply running through every partition one at a time? I've been stuck on this for days now.

In response to some comments here is some more information:
The string can be just about anything, eg "foobar(foo2)" -- our alphabet is lowercase alphanumeric plus all three type of braces ("(","[","{"), hyphens and spaces.
The goal is to get the likelihood of the string given individual 'word' likelihoods. So L(S='abc')=P('abc') + P('ab')P('c') + P('a')P('bc') + P('a')P('b')P('c') (Here "P('abc')" indicates the probability of the 'word' 'abc', while "L(S='abc')" is the statistical likelihood of observing the string 'abc').

Solution

A Dynamic Programming solution (if I understood the question right):

def dynProgSolution(text, probs):
  probUpTo = [1]
  for i in range(1, len(text)+1):
    cur = sum(v*probs[text[k:i]] for k, v in enumerate(probUpTo))
    probUpTo.append(cur)
  return probUpTo[-1]

print dynProgSolution(
  'abc',
  {'a': 0.1, 'b': 0.2, 'c': 0.3,
   'ab': 0.4, 'bc': 0.5, 'abc': 0.6}
  )

The complexity is O(N²) so it will easily solve the problem for N=20.

How why does this work:

Everything you will multiply by probs['a']*probs['b'] you will also multiply by probs['ab']
Thanks to the Distributive Property of multiplication and addition, you can sum those two together and multiply this single sum by all of its continuations.
For every possible last substring, it adds the sum of all splits ending with that by adding its probability multiplied by the sum of all probabilities of previous paths. (alternative phrasing would be appreciated. my python is better than my english..)

OTHER TIPS

First, profile to find the bottleneck.

If the bottleneck is simply the massive number of possible partitions, I recommend parallelization, possibly via multiprocessing. If that's still not enough, you might look into a Beowulf cluster.

If the bottleneck is just that the calculation is slow, try shelling out to C. It's pretty easy to do via ctypes.

Also, I'm not really sure how you're storing the partitions, but you could probably squash memory consumption a pretty good bit by using one string and a suffix array. If your bottleneck is swapping and/or cache misses, that might be a big win.

Your substrings are going to be reused over and over again by the longer strings, so caching the values using a memoizing technique seems like an obvious thing to try. This is just a time-space trade off. The simplest implementation is to use a dictionary to cache values as you calculate them. Do a dictionary lookup for every string calculation; if it's not in the dictionary, calculate and add it. Subsequent calls will make use of the pre-computed value. If the dictionary lookup is faster than the calculation, you're in luck.

I realise you are using Python, but... as a side note that may be of interest, if you do this in Perl, you don't even have to write any code; the built in Memoize module will do the caching for you!

You may get a minor reduction of the amount of computation by a small refactoring based on associative properties of arithmetic (and string concatenation) though I'm not sure it will be a life-changer. The core idea would be as follows:

consider a longish string e.g. 'abcdefghik', 10-long, for definiteness w/o loss of generality. In a naive approach you'd be multiplying p(a) by the many partitions of the 9-tail, p(ab) by the many partitions of the 8-tail, etc; in particular p(a) and p(b) will be multiplying exactly the same partitions of the 8-tail (all of them) as p(ab) will -- 3 multiplications and two sums among them. So factor that out:

(p(ab) + p(a) * p(b)) * (partitions of the 8-tail)

and we're down to 2 multiplications and 1 sum for this part, having saved 1 product and 1 sum. to cover all partitions with a split point just right of 'b'. When it comes to partitions with a split just right of 'c',

(p(abc) + p(ab) * p(c) + p(a) * (p(b)*p(c)+p(bc)) * (partitions of the 7-tail)

the savings mount, partly thanks to the internal refactoring -- though of course one must be careful about double-counting. I'm thinking that this approach may be generalized -- start with the midpoint and consider all partitions that have a split there, separately (and recursively) for the left and right part, multiplying and summing; then add all partitions that DON'T have a split there, e.g. in the example, the halves being 'abcde' on the left and 'fghik' on the right, the second part is about all partitions where 'ef' are together rather than apart -- so "collapse" all probabilities by considering that 'ef' as a new 'superletter' X, and you're left with a string one shorter, 'abcdXghik' (of course the probabilities for the substrings of THAT map directly to the originals, e.g. the p(cdXg) in the new string is just exactly the p(cdefg) in the original).

You should look into the itertools module. It can create a generator for you that is very fast. Given your input string, it will provide you with all possible permutations. Depending on what you need, there is also a combinations() generator. I'm not quite sure if you're looking at "b|ca" when you're looking at "abc," but either way, this module may prove useful to you.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow