Looking for elegant glob-like DNA string expansion

https://stackoverflow.com/questions/1098461

11-09-2019
|

Question

I'm trying to make a glob-like expansion of a set of DNA strings that have multiple possible bases.

The base of my DNA strings contains the letters A, C, G, and T. However, I can have special characters like M which could be an A or a C.

For example, say I have the string:

ATMM

I would like to take this string as input and output the four possible matching strings:

ATAA ATAC ATCA ATCC

Rather than brute force a solution, I feel like there must be some elegant Python/Perl/Regular Expression trick to do this.

Thank you for any advice.

Edit, thanks cortex for the product operator. This is my solution:

Still a Python newbie, so I bet there's a better way to handle each dictionary key than another for loop. Any suggestions would be great.

import sys
from itertools import product

baseDict = dict(M=['A','C'],R=['A','G'],W=['A','T'],S=['C','G'],
                  Y=['C','T'],K=['G','T'],V=['A','C','G'],
                  H=['A','C','T'],D=['A','G','T'],B=['C','G','T'])
def glob(str):
    strings = [str]

    ## this loop visits very possible base in the dictionary
    ## probably a cleaner way to do it
    for base in baseDict:
        oldstrings = strings
        strings = []
        for string in oldstrings:
            strings += map("".join,product(*[baseDict[base] if x == base 
                                 else [x] for x in string]))
    return strings

for line in sys.stdin.readlines():
    line = line.rstrip('\n')
    permutations = glob(line)
    for x in permutations:
        print x

Solution

Agree with other posters that it seems like a strange thing to want to do. Of course, if you really want to, there is (as always) an elegant way to do it in Python (2.6+):

from itertools import product
map("".join, product(*[['A', 'C'] if x == "M" else [x] for x in "GMTTMCA"]))

Full solution with input handling:

import sys
from itertools import product

base_globs = {"M":['A','C'], "R":['A','G'], "W":['A','T'],
              "S":['C','G'], "Y":['C','T'], "K":['G','T'],

              "V":['A','C','G'], "H":['A','C','T'],
              "D":['A','G','T'], "B":['C','G','T'],
              }

def base_glob(glob_sequence):
    production_sequence = [base_globs.get(base, [base]) for base in glob_sequence]
    return map("".join, product(*production_sequence))

for line in sys.stdin.readlines():
    productions = base_glob(line.strip())
    print "\n".join(productions)

OTHER TIPS

You probably could do something like this in python using the yield operator

def glob(str):
      if str=='':           
          yield ''
          return      

      if str[0]!='M':
          for tail in glob(str[1:]): 
              yield str[0] + tail                  
      else:
         for c in ['A','G','C','T']:
             for tail in glob(str[1:]):
                 yield c + tail                 
      return

EDIT: As correctly pointed out I was making a few mistakes. Here is a version which I tried out and works.

This isn't really an "expansion" problem and it's almost certainly not doable with any sensible regular expression.

I believe what you're looking for is "how to generate permutations".

You could for example do this recursively. Pseudo-code:

printSequences(sequence s)
  switch "first special character in sequence"
    case ...
    case M:
      s1 = s, but first M replaced with A
      printSequences(s1)
      s2 = s, but first M replaced with C
      printSequences(s2)
    case none:
      print s;

Regexps match strings, they're not intended to be turned into every string they might match.

Also, you're looking at a lot of strings being output from this - for instance:

MMMMMMMMMMMMMMMM (16 M's)

produces 65,536 16 character strings - and I'm guessing that DNA sequences are usually longer than that.

Arguably any solution to this is pretty much 'brute force' from a computer science perspective, because your algorithm is O(2^n) on the original string length. There's actually quite a lot of work to be done.

Why do you want to produce all the combinations? What are you going to do with them? (If you're thinking to produce every string possibility and then look for it in a large DNA sequence, then there are much better ways of doing that.)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow