Biopython Large Sequence splitting

https://stackoverflow.com//questions/23011570

21-12-2019
|

Question

I'm a newbie in the field of python programming. As I was trying to do some analysis,(I've tried to find the answer on other posts, but nothing) I decided to post my first and probably very foolish question. Why does this create only one output file although in this example there were supposed to be at least 8 (sequence is more than 8000 characters). Thank you for your answer upfront.

def batch_iterator(iterator, batch_size) :
    entry = True
    while entry :
        batch = []
        while len(batch) < batch_size :
            try :
                entry = iterator.next()
            except StopIteration :
                entry = None
            if entry is None :
                #End of file
                break
            batch.append(entry)
        if batch :
            yield batch


from Bio import SeqIO
record_iter = SeqIO.parse(open("some.fasta"),"fasta")
for i, batch in enumerate(batch_iterator(record_iter, 1000)) :   #I think sth is wrong here?
    filename = "group_%i.fasta" % (i+1)
    handle = open(filename, "w")
    count = SeqIO.write(batch, handle, "fasta")
    handle.close()
    print "Wrote %i records to %s" % (count, filename)

Solution

Sequence chunks

After a long discussion with the OP, here is my very restructured proposal, using the generator function defined in this other SO thread

# file: main.py
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord


def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in xrange(0, len(l), n):
        yield l[i:i+n]

if __name__ == '__main__':
    handle = open('long.fasta', 'r')
    records = list(SeqIO.parse(handle, "fasta"))
    record = records[0]

    for pos, chunk in enumerate(chunks(record.seq.tostring(), 1000)):
        chunk_record = SeqRecord(Seq(
            chunk, record.seq.alphabet),
            id=record.id, name=record.name,
            description=record.description)
        outfile = "group_%d.fasta" % pos
        SeqIO.write(chunk_record, open(outfile, 'w'), "fasta")

Note that your original code does something very different: it takes new records from the generator provided by the SeqIO.parse function, and tries to store them in different files. If you want to split a single record in smaller sub-sequences, you have to access the record's internal data, which is done by record.seq.tostring(). The chunks generator function, as described in the other thread linked above, returns as many chunks as is possible to build from the passed in sequence. Each of them is stored as a new fasta record in a different file (if you want to keep just the sequence, write the chunk directly to the opened outfile).

Check that it works

Consider the following code:

# file: generate.py
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
from Bio import SeqIO

long_string = "A" * 8000
outfile = open('long.fasta', 'w')

record = SeqRecord(Seq(
    long_string,
    IUPAC.protein),
    id="YP_025292.1", name="HokC",
    description="toxic membrane protein, small")

SeqIO.write(record, outfile, "fasta")

It writes a single record to a file named "long.fasta". This single record has a Sequence inside that is 8000 characters long, as generated in long_string.

How to use it:

$ python generate.py
$ wc -c long.fasta
8177 long.fasta

The overhead over 8000 characters is the file header.

How to split that file in chunks of 1000 length each, with the code snippet above:

$ python main.py
$ ls
generate.py   group_1.fasta group_3.fasta group_5.fasta group_7.fasta main.py
group_0.fasta group_2.fasta group_4.fasta group_6.fasta long.fasta
$ wc -c group_*
1060 group_0.fasta
1060 group_1.fasta
1060 group_2.fasta
1060 group_3.fasta
1060 group_4.fasta
1060 group_5.fasta
1060 group_6.fasta
1060 group_7.fasta
8480 total

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow