Number of distinct substrings in a string
-
16-10-2019 - |
Pergunta
From what I have come to understand, the best way to implement it is to use the suffix array $S$ of the string $w$ and its LCP-array (Longest Common Prefix) $L$.
The answer can be obtained by
$$ \sum_{i=1}^{|w|} \left( |S[i]| -L[i-1] \right).$$
What I don't get is how and why is this working?
I would be very grateful if someone explained this.
Solução
Instead of a formal proof, I want to give some intuition behind the formula. The suffix array contains all the suffixes of the string $w$. A substring is nothing else than a prefix of a suffix. So if you count $\sum_i |S[i]|$, you will get all the substrings, but of course you overcount the number of different substrings.
Let's have a closer look. Assume the $S[i-1]=xyz$ and $S[i]=xyxyz$. By the above counting method the entry $S[i-1]$ counted the substrings $x,xy$ and $xyz$ and the entry $S[i]$ counted $x,xy,xyx,xyxy,xyxyz$. You will notice that since the prefixes of length 2 of both entries are the same we have double counted $x,xy$. But the length of the longest common prefix is stored in $L[i]$ so we subtract it to compensate for the overcounting.