Pergunta

There are 4 DNA nucleotides, each represented by one of the letters A, C, G, or T.

Assume that they can be arranged in a string e.g. "GATTACA...". I want to figure out the minimum number of nucleotides I would have to encounter before I'm pigeonholed into seeing a duplicate substring of length 3.

Observe that a string of 4 nucleotides could produce a duplicate substring of length 3 (although I'm not forced into a pigeonhole since there are other strings of length 4 that don't produce a duplicate).

AAAA

Substrings [0-3) and [1-4) are both "AAA", so a duplicate occurrence of a length 3 substring has appeared.

I originally tried concatenating together every combination of 3 nucleotides.

[AAA][CAA][GAA][TAA][ACA][CCA]
AAACAAGAATAAACACCA...

But this algorithm produces a duplicate rather quickly: substrings [0-3) and [10-13).

I know there are 4^3 = 64 different substrings of length 3, so an upper bound to force pigeonholing is a string of length (64 * 3) + 1 = 195 -> one of the 64 substrings must appear twice. I've convinced myself that this can be bounded tighter to length 67 since this could contain 64 different substrings that start at indexes 0-63 and one duplicate that starts at 64. The interleaving of the substrings with each other is what makes them hard to analyze. How much tighter can this bound go, and what is an algorithm for producing a sequence that forces a pigeonhole?

Nenhuma solução correta

Licenciado em: CC-BY-SA com atribuição
Não afiliado a cs.stackexchange
scroll top