Generating random sequences of DNA

Question 1

You return too quickly:

from random import choice
def String(length):

   DNA=""
   for count in range(length):
      DNA+=choice("CGTA")
      return DNA

If your return statement is inside the for loop, you will only iterate once --- you will exit out of the function with the return.

From the Python Documentation on return statements: "return leaves the current function call with the expression list (or None) as return value."

So, put the return at the end of your function:

def String(length):

       DNA=""
       for count in range(length):
          DNA+=choice("CGTA")
       return DNA

EDIT: Here's a weighted choice method (it will only work for strings currently, since it uses string repetition).

def weightedchoice(items): # this doesn't require the numbers to add up to 100
    return choice("".join(x * y for x, y in items))

Then, you want to call weightedchoice instead of choice in your loop:

DNA+=weightedchoice([("C", 10], ("G", 20), ("A", 40"), ("T", 30)])

Question 2

I'd generate the string all in one go, rather than build it up. Unless Python's being clever and optimising the string additions, it'll reduce the runtime complexity from quadratic to linear.

import random

def DNA(length):
    return ''.join(random.choice('CGTA') for _ in xrange(length))

print DNA(5)

Question 3

I've upgraded the code to provide GC percent distribution from 0 to 100%. The code above always produces a 50% distribution.

The actg_distribution string can be any length of an existing DNA sequence of known GC percent. GC percent of a certain range is a common use case.


import random

# Return random CGTA sequences, set minimum = maximum to get a specified length.
def random_length_dnasequence(minimum=25, maximum=10000, actg_distribution=None):
    if (minimum == maximum):
        length = minimum
    else:
        length = random.randint(minimum, maximum)
    if (actg_distribution == None):
        actg_distribution = ''.join(random.choice('cgta') for _x in xrange(7))

    return ''.join(random.choice(actg_distribution) for _x in xrange(length))


def random_dnasequence(length, actg_distribution=None):
    return random_length_dnasequence(length, length, actg_distribution)

Question 4

Fast function for python 3.6 using random.choices

import random

def string(length=int(), letters="CGTA"):
        #slower 0.05s for 20000 nt
#     dna =""
#     for count in range(length):
#         dna+=choice("CGTA")
#     return dna

    #0.013s for 20000 nt
    return''.join(random.choices(letters, k=length)

Question 5

Perhaps numpy works faster thanks to vectorization?:

import numpy as np
seq_length = 100
my_seq = ''.join(np.random.choice(('C','G','T','A'), seq_length ))