Time complexity for calculating the $LPS$ array (failure function) in KMP algorithm preprocessing

https://cs.stackexchange.com/questions/118277

28-09-2020
|

Pergunta

I'm studying the Knuth Morris Pratt pattern searching algorithm from here. I want to know the order of computing the $lps$ array in this algorithm - that is the array of longest proper prefixes that are also suffix. I studied the Wikipedia page for this algorithm and it says this part is of $O(m)$ - $m$ is the length of the pattern being searched. I just don't understand why.

Can anyone help?

Solução

I've copied below the wikipedia algorithm for computing the LPS array (which is usually called the failure function because in the main KMP algorithm you refer to this table whenver there's a failed match). They denote $T[]$ where you use $LPS[]$

algorithm kmp_table: input: an array of characters, W (the word to be analyzed) an array of integers, T (the table to be filled) output: nothing (but during operation, it populates the table)

define variables:
    an integer, pos ← 1 (the current position we are computing in T)
    an integer, cnd ← 0 (the zero-based index in W of the next character of the current candidate substring)

let T[0] ← -1

while pos < length(W) do
    if W[pos] = W[cnd] then
        let T[pos] ← T[cnd]
    else
        let T[pos] ← cnd
        let cnd ← T[cnd] (to increase performance)
        while cnd >= 0 and W[pos] <> W[cnd] do
            let cnd ← T[cnd]
    let pos ← pos + 1, cnd ← cnd + 1

let T[pos] ← cnd (only need when all word occurrences searched)

The running time will be dictated by how many times "cnd" gets assigned to. The while loop inside the for loop ($pos$) might make you think the algorithm is $O(m^2)$ but as you learned it is $O(m)$.

Inspect how $cnd$ may change during one step of the for loop; it can either increase by +1, or it can decrease inside the while loop. And when it decreases, it decreases by at least 1, since $T[]$ satisfies $T[x]<x$ for all $x$ ("proper" in the definition of LPS). Since $cnd$ can increase only by +1 and there are $m$ steps in the for loop, this means the overall increase is bounded by $+m$. You might visualize each change to $cnd$ as a stair of integer height, with upstairs and downstairs representing increases and decreases to $cnd$; the upstairs are all height 1, and the downstairs are each height at least 1. Because the overall height never falls below 0 (that is the termination condition for the while loop), you can bound the total number of stairs by $2m$; there are at most $m$ "up" stairs and there can't be more "down" stairs than up stairs since their height is $\geq 1$.

To put another way, this is basically a twist on the saying "what goes up most come down"; here we have "what comes down must have gone up before"

Licenciado em: CC-BY-SA com atribuição

Não afiliado a cs.stackexchange