Question

Let's say I have a string of DNA 'GAAGGAGCGGCGCCCAAGCTGAGATAGCGGCTAGAGGCGGGTAACCGGCA'

Consider the first 5 letters: GAAGG

And I want to replace each overlapping bi-gram 'GA','AA','AG','GG' with some number that corresponds to their likelihood of occurrence, summing them. Like 'GA' = 1, 'AA' = 2, 'AG' = .7, 'GG' = .5. So for GAAGG I would have my sumAnswer = 1 + 2 + .7 + 5.

So in pseduo code, I want to... -iterate over each overlapping bi-gram in my DNA string -find the corresponding value to each unique bi-gram pair -sum each value iteratively

I'm not enitrely sure how to iterate over each pair. I thought a for loop would work, but that doesn't account for the overlap: it prints every 2-pair (GAGC = GA,GC), not every overlapping 2-pair (GAGC = GA,AG,GC)

for i in range(0, len(input), 2):
      print input[i:i+2]

Any tips?

Was it helpful?

Solution 2

Just leave out the ,2 in your range and make sure to not arrive at the very end of your string:

for i in range(0, len(input)-1):
    print input[i:i+2]

The ,2 tells Python to step forward two on every iteration. By leaving it out, you default to stepping forward one.

OTHER TIPS

Forget playing with range and index arithmetic, iterating over pairs is exactly what zip is for:

>>> dna = 'GAAGG'
>>> for bigram in zip(dna, dna[1:]):
...    print(bigram)
... 
('G', 'A')
('A', 'A')
('A', 'G')
('G', 'G')

If you have the corresponding likelihoods stored in a dictionary, like so:

likelihood = {
   'GA': 1, 
   'AA': 2,
   'AG': .7, 
   'GG': .5
}

then you can sum them quite easily with the unsurprisingly named sum:

>>> sum(likelihood[''.join(bigram)] for bigram in zip(dna,dna[1:]))
4.2

I'd use the pairwise function described at more_itertools

The other answer should do it.

If you really want an iterator:

# define the iterator
def dnaiter(input): 
    for i in xrange(0, len(input) - 1): 
        yield input[i:i+2]

# then use the iterator
for s in dnaiter(input): 
    print s

You'll only ever need this if you have a really long sequence that you're iterating over, though.

I wrote a small utility library that has a function named paired which does almost exactly what you want. The library is available here.

import iterlib

sequence = 'GAAGG'
bigrams = [''.join(bigram_tuple) for bigram_tuple in iterlib.paired(sequence)]

print(bigrams)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top