Question

I am reading about Suffix Arrays and the code to build one is simple. But all the resources I have found usually use a trivial example text, which usually is banana, to explain the concept.
So although the example text is trivial and the suffix array is presented (a,ana,anana,banana,na,nana) as I understand this can be applied to any kind of text.
But I don't understand how, as a text has spaces, new line characters, punctuantion marks etc.
So how does this apply to any kind of text?

Était-ce utile?

La solution

For a longer string your suffix array would look like so:

[01] banana split, yum!
[02] anana split, yum!
[03] nana split, yum!
[04] ana split, yum!
[05] na split, yum!
[06] a split, yum!
[07]  split, yum!
[08] split, yum!
[09] plit, yum!
[10] lit, yum!
[11] it, yum!
[12] t, yum!
[13] , yum!
[14]  yum!
[15] yum!
[16] um!
[17] m!
[18] !

You can then sort it in alphabetical order to find the longest repeated substring, a common use for suffix arrays.

I also remember doing something similar to find repeating patterns of words in a long text, and I used the space character as a separator, instead of going through each character:

[01] if it is true it is true
[02] it is true it is true
[03] is true it is true
[04] true it is true
[05] it is true
[06] is true
[07] true

Although this not a suffix array, once sorted alphabetically, one can find repeated patterns of words:

[01] if it is true it is true
[06] is true
[03] is true it is true
[05] it is true
[02] it is true it is true
[07] true
[04] true it is true

By comparing each line to the line above it, as long as the characters match, we find that 'is true' and 'it is true' are repeated patterns of words.

This method is inspired by a common DNA research problem called the longest repeated substring problem: http://en.wikipedia.org/wiki/Longest_repeated_substring_problem

Of course, it does arise in other fields than genetic science.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top