Biopython — reading a fixed number of seq_records at a time

https://stackoverflow.com/questions/11348880

19-06-2021
|

سؤال

I built some code that retrieves PHRED scores from a fastq file, puts them all into a single list, and then passes the list on to another function. It looks like so:

def PHRED_get():
    temp_scores = []
    all_scores = []
    fastq_location
    print("Building PHRED score bins...")
    for seq_record in SeqIO.parse(fastq_location, "fastq"):
        temp_scores = seq_record.letter_annotations
        temp_scores = temp_scores['phred_quality']
        all_scores.append(temp_scores)
    all_scores = list(itertools.chain(*all_scores))
    score_bin_maker(all_scores)

The problem is that this loop will continue until all seq_records have been searched and corresponding PHRED scores retrieved. In order to be more RAM conservative, I'd like to have some code that reads a smaller number of seq_records at a time (say, 100), and then pops their respective quality scores onto my ongoing uberlist. It would then go grab info from the next 100 seq_records and do the loop again. I'm having trouble understanding how to get this done. Any ideas?

المحلول

Simple: Keep a counter and when it reaches 100, break from the loop. Or some other early halt condition like if len(temp_scores) > 1000: break would work too.

Elegant: Use itertools.islice to take just the first 100 records from the iterator,

import itertools

def PHRED_get():
    temp_scores = []
    all_scores = []
    fastq_location
    print("Building PHRED score bins...")
    for seq_record in itertools.islice(SeqIO.parse(fastq_location, "fastq"), 100):
        temp_scores = seq_record.letter_annotations
        temp_scores = temp_scores['phred_quality']
        all_scores.append(temp_scores)
    all_scores = list(itertools.chain(*all_scores))
    score_bin_maker(all_scores)

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow