how to create an index to parse big text file

Question 1

An index is just a shortened version of the information you are working with. In this case, you will want the "key" - the text between the first colon(':') on the @-line and the final slash('/') near the end - as well as some kind of value.

Since the "value" in this case is the entire contents of the 4-line block, and since our index is going to store a separate entry for each block, we would be storing the entire file in memory if we used the actual value in the index.

Instead, let's use the file position of the beginning of the 4-line block. That way, you can move to that file position, print 4 lines, and stop. Total cost is the 4 or 8 or however many bytes it takes to store an integer file position, instead of however-many bytes of actual genome data.

Here is some code that does the job, but also does a lot of validation and checking. You might want to throw stuff away that you don't use.

import sys

def build_index(path):
    index = {}
    for key, pos, data in parse_fastq(path):
        if key not in index:
            # Don't overwrite duplicates- use first occurrence.
            index[key] = pos

    return index

def error(s):
    sys.stderr.write(s + "\n")

def extract_key(s):
    # This much is fairly constant:
    assert(s.startswith('@'))
    (machine_name, rest) = s.split(':', 1)
    # Per wikipedia, this changes in different variants of FASTQ format:
    (key, rest) = rest.split('/', 1)
    return key

def parse_fastq(path):
    """
    Parse the 4-line FASTQ groups in path.
    Validate the contents, somewhat.
    """
    f = open(path)
    i = 0
    # Note: iterating a file is incompatible with fh.tell(). Fake it.
    pos = offset = 0
    for line in f:
        offset += len(line)
        lx = i % 4
        i += 1
        if lx == 0:     # @machine: key
            key = extract_key(line)
            len1 = len2 = 0
            data = [ line ]
        elif lx == 1:
            data.append(line)
            len1 = len(line)
        elif lx == 2:   # +machine: key or something
            assert(line.startswith('+'))
            data.append(line)
        else:           # lx == 3 : quality data
            data.append(line)
            len2 = len(line)

            if len2 != len1:
                error("Data length mismatch at line "
                        + str(i-2)
                        + " (len: " + str(len1) + ") and line "
                        + str(i)
                        + " (len: " + str(len2) + ")\n")
            #print "Yielding @%i: %s" % (pos, key)
            yield key, pos, data
            pos = offset

    if i % 4 != 0:
        error("EOF encountered in mid-record at line " + str(i));

def match_records(path, index):
    results = []
    for key, pos, d in parse_fastq(path):
        if key in index:
            # found a match!
            results.append(key)

    return results

def write_matches(inpath, matches, outpath):
    rf = open(inpath)
    wf = open(outpath, 'w')

    for m in matches:
        rf.seek(m)
        wf.write(rf.readline())
        wf.write(rf.readline())
        wf.write(rf.readline())
        wf.write(rf.readline())

    rf.close()
    wf.close()

#import pdb; pdb.set_trace()
index = build_index('afile.fastq')
matches = match_records('bfile.fastq', index)
posns = [ index[k] for k in matches ]
write_matches('afile.fastq', posns, 'outfile.fastq')

Note that this code goes back to the first file to get the blocks of data. If your data is identical between files, you would be able to copy the block from the second file when a match occurs.

Note also that depending on what you are trying to extract, you may want to change the order of the output blocks, and you may want to make sure that the keys are unique, or perhaps make sure the keys are not unique but are repeated in the order they match. That's up to you - I'm not sure what you're doing with the data.

Question 2

these guys claim to parse a few gigs file while using a dedicated library, see http://www.biostars.org/p/15113/

fastq_parser = SeqIO.parse(fastq_filename, "fastq") 
wanted = (rec for rec in fastq_parser if ...)
SeqIO.write(wanted, output_file, "fastq")

a better approach IMO would be to parse it once and load the data to some database instead of that output_file (i.e mysql) and latter run the queries there