find unique first top and bottom lines of fastq file from fasta file [closed]

Question 1

Unless you have a strong reason to do this yourself, use Biopython.

fasta:

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGG

fastq (based on yours but not identical because your output was badly formatted):

@DH1DQQN1:269:C1UKCACXX:1:1107:20386:6577 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC
+
CCCFFFFFHGHHHJIJHFDDDB173@8815BDDB###############
@DH1DQQN1:269:C1UKCACXX:1:1114:5718:53821 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
CCCFFFFFHGHHHJIJHFDDDB173@8815BDDB###############
@DH1DQQN1:269:C1UKCACXX:1:1209:10703:35361 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
@@@FFFFFHGHHHGIJHFDDDDDBDD69@6B-707537BDDDB75@@85
@DH1DQQN1:269:C1UKCACXX:1:1210:18926:75163 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
+
@CCFFFFFHHHHHJJJHFDDD@77BDDDDB077007@B###########

Code:

from Bio import SeqIO

with open("fasta") as fh:
    fasta = fh.read().splitlines()

seen = set()

for record in SeqIO.parse(open('fastq'), 'fastq'):
    seq = str(record.seq)
    if seq in fasta and seq not in seen:
        seen.add(seq)
        print record.format('fastq')

EDIT: The above prints records in the order of the fastq file, not the fasta file. If order is not important, you should use that method. Otherwise, you can add the records to a dictionary where the key is their index in the FASTA file, and print them all in the end, sorting the dictionary:

from Bio import SeqIO
import sys

with open("fasta") as fh:
    fasta = fh.read().splitlines()

seen = set()
records = {}

for record in SeqIO.parse(open('fastq'), 'fastq'):
    seq = str(record.seq)
    if seq in fasta and seq not in seen:
        seen.add(seq)
        records[fasta.index(seq)] = record

for record in sorted(records):
    sys.stdout.write(records[record].format('fastq'))

(Here I also use sys.stdout.write instead of print, to avoid the extra newlines.)

Question 2

So it sounds like you want a fastq with only unique sequences?

This is a horribly inefficient way to do it but it should work. It stores your fastq file as a list so hopefully its not too big. It will only throw out duplicate sequences, not quality scores or anything.

fastqFile = list(open(fastq))
out = []
output = open('output.fastq', 'at')

for lineNum, line in enumerate(fastqFile):
    if lineNum < 4:
        out.append(line)
        output.write(line)
    else:
        if line not in out and lineNum % 4 != 3:
            output.write(fastqFile[lineNum - 1])
            output.write(line)
            output.write(fastqFile[lineNum + 1])
            output.write(fastqFile[lineNum + 2])
            out.append(fastqFile[lineNum - 1])
            out.append(line)
            out.append(fastqFile[lineNum + 1])
            out.append(fastqFile[lineNum + 2])

Question 3

I think I knew what you're trying to say, so here's my code. It will only take the first occurrence of the fasta sequence, as requested. This probably isn't the best way to do it, but I'm new at python.

# open the file into a list
fasta = open('fasta1.fa', 'r').read().splitlines()
fastq = open('fastq1.fq', 'r').read().splitlines()

# get rid of headers
# if headers important, please indicate in example
fastaseq = [s for s in fasta if not any('>' in t for t in s)]

# get rid of whitespace
fastaseq = filter(None, fastaseq)
fastq = filter(None, fastq)

# new list
newfastq = []

# go through each item in your fasta list
# if it matches, get the line above and below
# put in the new list
for fa in fastaseq:
    if fa in fastq:
        ind = fastq.index(fa)
        printblock = fastq[ind-1:ind+2]
    elif fa not in fastq:
        printblock = []
    if printblock:
        newfastq.append(printblock)

# print everything to file  
with open('fastq2.fq', 'w') as f:
    for block in newfastq:
        for item in block:
            f.writelines(item + '\n')