String analysis

https://stackoverflow.com/questions/5975840

12-11-2019
|

Question

Given a sequence of operations:

a*b*a*b*a*a*b*a*b

is there a way to get the optimal subdivision to enable reusage of substring.

making

a*b*a*b*a*a*b*a*b => c*a*c, where c = a*b*a*b

and then seeing that

a*b*a*b => d*d, where d = a*b

all in all reducing the 8 initial operations into the 4 described here?

(c = (d = a*b)*d)*a*c

The goal of course is to minimize the number of operations

I'm considering a suffixtree of sorts.

I'm especially interested in linear time heuristics or solutions. The '*' operations are actually matrix multiplications.

Solution

This whole problem is known as "Common Subexpression Elimination" or CSE. It is a slightly smaller version of the problem called "Graph Reduction" faced by the implementer of compilers for functional programming languages. Googling "Common Subexpression elimination algorithm" gives lots of solutions, though none that I can see especially for the constraints given by matrix multiplication.

The pages linked to give a lot of references.

My old answer is below. However, having researched a bit more, the solution is simply building a suffix tree. This can be done in O(N) time (lots of references on the wikipedia page). Having done this, the sub-expressions (c, d etc. in your question) are just nodes in the suffix tree - just pull them out.

However, I think MarcoS is on to something with the suggestion of Longest repeating Substring, as graph reduction precedence might not allow optimisations that can be allowed here.

sketch of algorithm:

optimise(s):
    let sub = longestRepeatingSubstring(s).
    optimisedSub = optimise(sub)
    return s with sub replaced by optimisedSub

Each run of longest repeating substring takes time N. You can probably re-use the suffix tree you build to solve the whole thing in time N.

OTHER TIPS

edit: The orders-of-growth in this answer are needed in addition to the accepted answer in order to run CSE or matrix-chain multiplication

Interestingly, a compression algorithm may be what you want: a compression algorithm seeks to reduce the size of what it's compressing, and if the only way it can do that is substitution, you can trace it and obtain the necessary subcomponents for your algorithm. This may not give nice results though for small inputs.

What subsets of your operations are commutative will be an important consideration in choosing such an algorithm. [edit: OP says no operations are commutative in his/her situation]

We can also define an optimal solution, if we ignore effects such as caching:

input: [some product of matrices to compute]

given that multiplying two NxN matrices is O(N^2.376)
given we can visualize the product as follows:
    [[AxB][BxC][CxD][DxE]...]
we must for example perform O(max(A,B,C)^2.376) or so operations in order to combine
    [AxB][BxC] -> [AxC]

The max(...) is an estimate based on how fast it is to multiply two square matrices;
a better estimate of cost(A,B,C) for multiplying an AxB * BxC matrix can be gotten 
from actually looking at the algorithm, or running benchmarks if you don't know the
algorithm used.

However note that multiplying the same matrix with itself, i.e. calculating
a power, can be much more efficient, and we also need to take that into account.
At worst, it takes log_2(power) multiplies each of O(N^2.376), but this could be
made more efficient by diagonalizing the matrix first.

There is the question about whether a greedy approach is feasible for not: whether one SHOULD compress repeating substrings at each step. This may not be the case, e.g.

aaaaabaab
compressing 'aa' results in ccabcb and compressing 'aab' is now impossible

However I have a hunch that, if we try all orders of compressing substrings, we will probably not run into this issue too often.

Thus having written down what we want (the costs) and considered possibly issues, we already have a brute-force algorithm which can do this, and it will run for very small numbers of matrices:

# pseudocode

def compress(problem, substring)
    x = new Problem(problem)
    x.string.replaceall(substring, newsymbol)
    x.subcomputations += Subcomputation(newsymbol=substring)

def bestCompression(problem)
    candidateCompressions = [compress(problem,substring) for each substring in problem.string]
    # etc., recursively return problem with minimum cost
    # dynamic programming may help make this more efficient, but one must watch
    #   out for the note above, how it may be hard to be greedy

Note: according to another answer by Asgeir, this is known as the Matrix Chain Multiplication optimization problem. Nick Fortescue notes this is also known more generally as http://en.wikipedia.org/wiki/Common_subexpression_elimination -- thus one could find any generic CSE or Matrix-Chain-Multiplication algorithm/library from the literature, and plug in the cost orders-of-magnitude I mentioned earlier (you will need those nomatter which solution you use). Note that the cost of the above calculations (multiplication, exponentiation, etc.) assume that they are being done efficiently with state-of-the-art algorithms; if this is not the case, replace the exponents with appropriate values which correspond to the way the operations will be carried out.

If you want to use the fewest arithmetic operations then you should have a look at matrix chain multiplication which can be reduced to O(n log n)

From the top of the head the problem seems in NP for me. Depending on the substitutions you are doing other substitions will be possible or impossible for example for the string d*e*a*b*c*d*e*a*b*c*d*e*a there are several possibilities.

If you take the longest common string it will be: f = d*e*a*b*c and you could substitute f*f*e*a leaving you with three multiplications in the end and four intermediate ones (total seven).

If you instead substitute the following way: f = d*e*a you get f*b*c*f*b*c*f which you can further substitute using g = f*b*c to g*g*f for a total of six multiplication.

There are other possible substitutions in this problem, but I do not have the time to count them all right now.

I am guessing for a complete minimal substitution it is not only necessary to figure out the longest common substring but also the number of times each substring repeats, which probably means you have to track all substitutions so far and do backtracking. Still it might be faster than the actual multiplications.

Isn't this the Longest repeated substring problem?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow