Number of distinct substrings in a string

https://cs.stackexchange.com/questions/13140

16-10-2019
|

Pergunta

From what I have come to understand, the best way to implement it is to use the suffix array $S$ of the string $w$ and its LCP-array (Longest Common Prefix) $L$.

The answer can be obtained by

$$ \sum_{i=1}^{|w|} \left( |S[i]| -L[i-1] \right).$$

What I don't get is how and why is this working?

I would be very grateful if someone explained this.

Solução

Instead of a formal proof, I want to give some intuition behind the formula. The suffix array contains all the suffixes of the string $w$. A substring is nothing else than a prefix of a suffix. So if you count $\sum_i |S[i]|$, you will get all the substrings, but of course you overcount the number of different substrings.

Let's have a closer look. Assume the $S[i-1]=xyz$ and $S[i]=xyxyz$. By the above counting method the entry $S[i-1]$ counted the substrings $x,xy$ and $xyz$ and the entry $S[i]$ counted $x,xy,xyx,xyxy,xyxyz$. You will notice that since the prefixes of length 2 of both entries are the same we have double counted $x,xy$. But the length of the longest common prefix is stored in $L[i]$ so we subtract it to compensate for the overcounting.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a cs.stackexchange