Question

I am looking for a high performance algorithm to check whether I can reconstruct a given string using a given set of substrings. More details:

Let's say our strings are over the alphabet $\Sigma$.

Inputs:

  • A string $S \in \Sigma^*$
  • A finite set of strings $A = \{a_1, a_2, ..., a_n\} \subset \Sigma^*$.

Output:

  • Whether $\exists m: \exists b_1, b_2, \ldots, b_m \in A: b_1+b_2+\cdots+b_m = S$

where $+$ is string concatenation.

For example, if $S={}$"$abcd$" and $A = \{$"$ab$"$,$"$cd$"$,$"$ac$"$\}$, the answer is True. For the purposes of this question, assume that strings in $A$ can be reused multiple times if necessary.

Was it helpful?

Solution

If reusing strings in $A$ is allowed you can solve it with dynamic programming: First, store strings in $A$ in a prefix tree (just a reversed suffix tree link), and recursively detrmine if $S[i:end]$ can be constructed from $A$: Let $\ell$ be the length of the string, and wlog assume $S$ is appended by $\$$ and $A\gets A\cup \{\$\}$. Moroever, let $m[i]\in\{0,1\}$ indicate if $S[i:\ell]$ can be constructed by elements in $A$. Initialize $m[i]\gets \infty$ for $i=0,\dots, \ell-1$ and $m[\ell]=1$ (because $S[\ell]=\$$). Now, suppose $f(i)$ is the function that recursively computes $m[i]$ if it hasn't been computed already. When $f(i)$ is called, start at the root of the suffix tree of $A$, and traverse $S[i], S[i+1], \dots$ on the suffix tree until index $j$ , such that the corresponding node is in $A$, and recursively call $f(j)$. If returns $1$, set $m[i]=1$ and return $1$. Otherwise, repeat the process with traversal of $S[j+1], S[j+2], \dots$ until another member in $A$, or a leaf is reached. If there was no success, set $m[i]=0$ and return $0$. To decide the problem, call $f(1)$. The worst case complexity is the complexity is $\ell d$, with $d$ being depth of the suffix tree (longest sequence in $A$), plus the cost of construction of the prefix tree.

The worst-case scenario in many cases will be avoided because the recursive call search stops as soon as one solution is found. So in practice, the $\ell d$ part might be much less. The construction of the suffix tree can be expensive depending on $A$. But if $A$ is used for several strings $S$, the cost is amortized. If $A$ is really big and $S$ is short, it might be better to just sort elements of $A$ and do a binary search instead of traversal.

Licensed under: CC-BY-SA with attribution
Not affiliated with cs.stackexchange
scroll top