Pergunta

Consider the following problem. We are given a set of patterns (strings) $\Pi = \{\pi_i\}$, a text $s$, and a window length $k$. We want a list of all shifts $0 \le i \le |s|-k$ such that every pattern in $\Pi$ is contained in the substring $s[i:i+k]$.

Can this be solved in linear- or near-linear-time? It can of course be solved in quadratic time $O(|s| |\Pi| + \sum |\pi_i|)$ using KMP or Aho-Corasick plus post-processing.

The motivation for this problem is finding matches for a topic (represented by the set of patterns) in a text. In that context it actually makes sense to require the matches to be non-overlapping so I'm also interested in that case, but it might be easier to start with the relaxed version.

I would also be interested in generalizations of the problem that allow for approximate matches of some kind, eg, only requiring a threshold on the size of the subset of matching patterns, allowing matches within given edit distance, or something using hidden Markov models or general probabilistic graphical models. I would be surprised if any such generalization can be solved in subquadratic time though.

Nenhuma solução correta

Licenciado em: CC-BY-SA com atribuição
Não afiliado a cs.stackexchange
scroll top