Question

I have this huge (61GB) FASTQ file of which I want to create a random subset, but which I cannot load into memory. The problem with FASTQs is that every four lines belong together, otherwise I would just create a list of random integers and only write the lines at these integers to my subset file.

So far, I have this:

import random
num = []    
while len(num) < 50000000:
    ran = random.randint(0,27000000)
    if (ran%4 == 0) and (ran not in num):
        num.append(ran)
num = sorted(num)

fastq = open("all.fastq", "r", 4)
subset = open("sub.fastq", "w")
for i,line in enumerate(fastq):
    for ran in num:
        if ran == i:
            subset.append(line)

I have no idea how to reach the next three lines in the file before going to the next random integer. Can someone help me?

Was it helpful?

Solution 2

You could try this:

import random
num = sorted([random.randint(0,27000000/4)*4 for i in range(50000000/4)])

lines_to_write = 0
with open("all.fastq", "r") as fastq:
    with open("sub.fastq", "w") as subset:
        for i,line in enumerate(fastq):
            if len(num)==0:
                break
            if i == num[0]:
                num.pop(0)
                lines_to_write = 4
            if lines_to_write>0:
                lines_to_write -= 1
                subset.write(line)

OTHER TIPS

  1. Iterate over the file in chunks of four lines.
  2. Take a random sample from that iterator.

The idea is that you can sample from a generator without random access, by iterating through it and choosing (or not) each element in turn.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top