Finding various string repeats in python in next 10 characters

https://stackoverflow.com/questions/8813264

15-04-2021
|

Question

So I'm working on a problem where I have to find various string repeats after encountering an initial string, say we take ACTGAC so the data file has sequences that look like:

AAACTGACACCATCGATCAGAACCTGA

So in that string once we find ACTGAC then I need to analyze the next 10 characters for the string repeats which go by some rules. I have the rules coded but can anyone show me how once I find the string that I need, I can make a substring for the next ten characters to analyze. I know that str.partition function can do that once I find the string, and then the [1:10] can get the next ten characters.

Thanks!

Solution

You almost have it already (but note that indexes start counting from zero in Python).

The partition method will split a string into head, separator, tail, based on the first occurence of separator.

So you just need to take a slice of the first ten characters of the tail:

>>> data = 'AAACTGACACCATCGATCAGAACCTGA'
>>> head, sep, tail = data.partition('ACTGAC')
>>> tail[:10]
'ACCATCGATC'

Python allows you to leave out the start-index in slices (in defaults to zero - the start of the string), and also the end-index (it defaults to the length of the string).

Note that you could also do the whole operation in one line, like this:

>>> data.partition('ACTGAC')[2][:10]
'ACCATCGATC'

OTHER TIPS

So, based on marcog's answer in Find all occurrences of a substring in Python , I propose:

>>> import re
>>> data = 'AAACTGACACCATCGATCAGAACCTGAACTGACTGACAAA'
>>> sep = 'ACTGAC'
>>> [data[m.start()+len(sep):][:10] for m in re.finditer('(?=%s)'%sep, data)]
['ACCATCGATC', 'TGACAAA', 'AAA']

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow