Pergunta

I'm implementing algorithms in http://portal.acm.org/citation.cfm?id=1813708 that utilize suffix arrays to find longest common substrings. The algorithms involve constructing a suffix array for a string that is the concatenation of a set of given strings with string separators called sentinels. So for example, if we are given the strings a, b and c, a new string d is created which is a$1b$2c$3 where $1, $2, $3 are the sentinel characters marking the ends of each string. The sentinel characters must be unique and lexicographically less than all other characters in a, b and c.

My question revolves around the representation of the sentinel characters in Python. If a, b and c are ASCII strings, I'm thinking I might need to convert those strings to UTF-8 and shift their range from 0-127 to a higher range such that there are characters available that are lexicographically less than those in the strings. If that seems reasonable, what is the most efficient mechanism for remapping the characters in Python such that their range is N-127+N where N is the number of strings provided?

Nenhuma solução correta

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top