Finding various string repeats in python in next 10 characters
-
15-04-2021 - |
Question
So I'm working on a problem where I have to find various string repeats after encountering an initial string, say we take ACTGAC so the data file has sequences that look like:
AAACTGACACCATCGATCAGAACCTGA
So in that string once we find ACTGAC then I need to analyze the next 10 characters for the string repeats which go by some rules. I have the rules coded but can anyone show me how once I find the string that I need, I can make a substring for the next ten characters to analyze. I know that str.partition function can do that once I find the string, and then the [1:10] can get the next ten characters.
Thanks!
Solution
You almost have it already (but note that indexes start counting from zero in Python).
The partition
method will split a string into head, separator, tail
, based on the first occurence of separator
.
So you just need to take a slice of the first ten characters of the tail
:
>>> data = 'AAACTGACACCATCGATCAGAACCTGA'
>>> head, sep, tail = data.partition('ACTGAC')
>>> tail[:10]
'ACCATCGATC'
Python allows you to leave out the start-index in slices (in defaults to zero - the start of the string), and also the end-index (it defaults to the length of the string).
Note that you could also do the whole operation in one line, like this:
>>> data.partition('ACTGAC')[2][:10]
'ACCATCGATC'
OTHER TIPS
So, based on marcog's answer in Find all occurrences of a substring in Python , I propose:
>>> import re
>>> data = 'AAACTGACACCATCGATCAGAACCTGAACTGACTGACAAA'
>>> sep = 'ACTGAC'
>>> [data[m.start()+len(sep):][:10] for m in re.finditer('(?=%s)'%sep, data)]
['ACCATCGATC', 'TGACAAA', 'AAA']