Given a list of strings, find every pair $(x,y)$ where $x$ is a subsequence of $y$. Possible to do better than $O(n^2)$?

https://cs.stackexchange.com/questions/121670

29-09-2020
|

Question

Consider the following algorithmic problem: Given a list of strings $L = [s_1, s_2, \dots, s_n]$, we want to know all pairs $(x,y)$ where $x$ is a subsequence of $y$. We can assume all strings are of length at maximum $m$, where $m << n$ and are all over a finite alphabet $\Sigma$ with $|\Sigma| << n$. We may also assume that the number of pairs $(x,y)$ where $x$ is a subsequence of $y$ is much smaller than $n$.

A trivial algorithm would be this:

1. foreach x in L:
2.   foreach y in L:
3.      if x is subsequence of y:
4.         OUTPUT x,y

However, this has complexity $O(n^2 \cdot m)$ - I am curious to know whether there is a faster algorithm (Faster given that the number of pairs $(x,y)$ is much smaller than $n$, so for example an algorithm with complexity depending on the number of output pairs).

Note that this question is a follow up to this question, which is about the same problem but for substrings (not subsequences). There, the Aho-Corasick algorithm solved my problem perfectly - is there maybe somethign like this but for subsequences?

Solution

No, it is not possible to do better unless the Strong Exponential Time Hypothesis (SETH) fails. If we could solve this problem substantially faster than $O(n^2)$ we would immediately obtain a much faster algorithm for solving the NP-complete problem Satisfiability. This is true even for $m$ slightly more than $\log(n)$ and the case in which we want to decide whether such a pair $(x,y)$ exists at all.

See, e.g., these lecture notes under section 3 "Tight Lower Bounds for Orthogonal Vectors". The proof is analogous to the proof of Theorem 2 in these lecture notes.

First, we consider the more general problem of given two sets of strings $X,Y$, finding whether some string in $X$ is a subsequence of a string in $Y$.

Given a SAT formula, we split its $n$ variables into two equal sets of $n/2$ variables. In $\Sigma$ we take a character corresponding to every clause. In $X$ we add a string for every possible assignment to the first half of the variables, with a character corresponding to every clause not satisfied by those variables. Meanwhile, in $Y$, we add a string for every assignment to the second half of the variables, with a character for every clause that is satisfied by those variables. Clearly, the formula is satisfiable if and only if some string in $X$ is a subsequence of some string in $Y$.

If this problem can be solved substantially faster than $O(n^2)$, then this gives a substantially faster algorithm for Satisfiability than $2^n$. Suppose the problem could be solved in $O(n^{1.99})$ time, then Satisfiability could be solved in $(2^{n/2})^{1.99}=O(2^{0.996n})$ which contradicts SETH.

In your problem, there is only a single set of strings, all of which may be a subsequence of each other. This is however not a problem, as we can simply modify the strings in our instance such that no string $Y$ is a subsequence of any other string (for instance by padding all strings in $Y$ to have the same length), and similarly padding every string in $X$ to the same length as other strings in $X$ (but substantially shorter than strings in $Y$).

This can probably also be done with a constant-size (likely even binary) alphabet but this requires more clever encoding.

Licensed under: CC-BY-SA with attribution

Not affiliated with cs.stackexchange