Вопрос

I have 2 files one is fasta file and other one is fastq file. I want to take the fasta, read each line and search each line in the fastq file and print top line and bottom lines. This is what I have

fasta file

read1

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGG

for seq in `cat sequences`;do grep -A 2 -B 1 $seq FAP.1.txt;done

@DH1DQQN1:269:C1UKCACXX:1:1107:20386:6577 1:N:0:TTAGGC

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC

+

CCCFFFFFHGHHHJIJHFDDDB173@8815BDDB###############

@DH1DQQN1:269:C1UKCACXX:1:1114:5718:53821 1:N:0:TTAGGC

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

+ ;@?DBD<@@FFHHH<

@DH1DQQN1:269:C1UKCACXX:1:1209:10703:35361 1:N:0:TTAGGC

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

+

@@@FFFFFHGHHHGIJHFDDDDDBDD69@6B-707537BDDDB75@@85

@DH1DQQN1:269:C1UKCACXX:1:1210:18926:75163 1:N:0:TTAGGC

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG

@CCFFFFFHHHHHJJJHFDDD@77BDDDDB077007@B###########

From this we can see that AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA appeared twice, but I want to print only once. how can I do that?

Final output file

@DH1DQQN1:269:C1UKCACXX:1:1107:20386:6577 1:N:0:TTAGGC

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC

+

CCCFFFFFHGHHHJIJHFDDDB173@8815BDDB###############

@DH1DQQN1:269:C1UKCACXX:1:1114:5718:53821 1:N:0:TTAGGC

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

+

;@?DBD<@@FFHHH<

@DH1DQQN1:269:C1UKCACXX:1:1210:18926:75163 1:N:0:TTAGGC

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG

+

@CCFFFFFHHHHHJJJHFDDD@77BDDDDB077007@B

Это было полезно?

Решение

Unless you have a strong reason to do this yourself, use Biopython.

fasta:

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGG

fastq (based on yours but not identical because your output was badly formatted):

@DH1DQQN1:269:C1UKCACXX:1:1107:20386:6577 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC
+
CCCFFFFFHGHHHJIJHFDDDB173@8815BDDB###############
@DH1DQQN1:269:C1UKCACXX:1:1114:5718:53821 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
CCCFFFFFHGHHHJIJHFDDDB173@8815BDDB###############
@DH1DQQN1:269:C1UKCACXX:1:1209:10703:35361 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
@@@FFFFFHGHHHGIJHFDDDDDBDD69@6B-707537BDDDB75@@85
@DH1DQQN1:269:C1UKCACXX:1:1210:18926:75163 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
+
@CCFFFFFHHHHHJJJHFDDD@77BDDDDB077007@B###########

Code:

from Bio import SeqIO

with open("fasta") as fh:
    fasta = fh.read().splitlines()

seen = set()

for record in SeqIO.parse(open('fastq'), 'fastq'):
    seq = str(record.seq)
    if seq in fasta and seq not in seen:
        seen.add(seq)
        print record.format('fastq')

EDIT: The above prints records in the order of the fastq file, not the fasta file. If order is not important, you should use that method. Otherwise, you can add the records to a dictionary where the key is their index in the FASTA file, and print them all in the end, sorting the dictionary:

from Bio import SeqIO
import sys

with open("fasta") as fh:
    fasta = fh.read().splitlines()

seen = set()
records = {}

for record in SeqIO.parse(open('fastq'), 'fastq'):
    seq = str(record.seq)
    if seq in fasta and seq not in seen:
        seen.add(seq)
        records[fasta.index(seq)] = record

for record in sorted(records):
    sys.stdout.write(records[record].format('fastq'))

(Here I also use sys.stdout.write instead of print, to avoid the extra newlines.)

Другие советы

So it sounds like you want a fastq with only unique sequences?

This is a horribly inefficient way to do it but it should work. It stores your fastq file as a list so hopefully its not too big. It will only throw out duplicate sequences, not quality scores or anything.

fastqFile = list(open(fastq))
out = []
output = open('output.fastq', 'at')

for lineNum, line in enumerate(fastqFile):
    if lineNum < 4:
        out.append(line)
        output.write(line)
    else:
        if line not in out and lineNum % 4 != 3:
            output.write(fastqFile[lineNum - 1])
            output.write(line)
            output.write(fastqFile[lineNum + 1])
            output.write(fastqFile[lineNum + 2])
            out.append(fastqFile[lineNum - 1])
            out.append(line)
            out.append(fastqFile[lineNum + 1])
            out.append(fastqFile[lineNum + 2])

I think I knew what you're trying to say, so here's my code. It will only take the first occurrence of the fasta sequence, as requested. This probably isn't the best way to do it, but I'm new at python.

# open the file into a list
fasta = open('fasta1.fa', 'r').read().splitlines()
fastq = open('fastq1.fq', 'r').read().splitlines()

# get rid of headers
# if headers important, please indicate in example
fastaseq = [s for s in fasta if not any('>' in t for t in s)]

# get rid of whitespace
fastaseq = filter(None, fastaseq)
fastq = filter(None, fastq)

# new list
newfastq = []

# go through each item in your fasta list
# if it matches, get the line above and below
# put in the new list
for fa in fastaseq:
    if fa in fastq:
        ind = fastq.index(fa)
        printblock = fastq[ind-1:ind+2]
    elif fa not in fastq:
        printblock = []
    if printblock:
        newfastq.append(printblock)

# print everything to file  
with open('fastq2.fq', 'w') as f:
    for block in newfastq:
        for item in block:
            f.writelines(item + '\n')
Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top