Question

I am trying to extract a subset of sequences from a fasta file based on ID list, so far so good. My issue is that my ID list contains an extra column two (which represents the coding part of the sequence) and I want to keep that in the new fasta file

File 1: Id list
>TCONS_00000004  654:819
>TCONS_00000006  238:367
>TCONS_00000009  956:1555

File 2: fasta file
>TCONS_00000004
AAAATAATAAACTTTGCAAAGAGCAAATTTGAAGAAGCAGTTGATATACGTCGAGAGATTTCTGCAACAG 
CGCGATTATATTACATTCAATTAATTAAAATGCAGTACAGAGACATCCACGATTTCGTCAACATACCAGG
>TCONS_00000006
AAAATAATAAACTTTGCAAAGAGCAAATTTGAAGAAGCAGTTGATATACGTCGAGAGATTTCTGCAACAG 
CGCGATTATATTACATTCAATTAATTAAAATGCAGTACAGAGACATCCACGATTTCGTCAACATACCAGG
>TCONS_00000009
AAAATAATAAACTTTGCAAAGAGCAAATTTGAAGAAGCAGTTGATATACGTCGAGAGATTTCTGCAACAG 
CGCGATTATATTACATTCAATTAATTAAAATGCAGTACAGAGACATCCACGATTTCGTCAACATACCAGG

Expected outcome:
 >TCONS_00000004 654:819
AAAATAATAAACTTTGCAAAGAGCAAATTTGAAGAAGCAGTTGATATACGTCGAGAGATTTCTGCAACAG 
CGCGATTATATTACATTCAATTAATTAAAATGCAGTACAGAGACATCCACGATTTCGTCAACATACCAGG
>TCONS_00000006 238:367
AAAATAATAAACTTTGCAAAGAGCAAATTTGAAGAAGCAGTTGATATACGTCGAGAGATTTCTGCAACAG 
CGCGATTATATTACATTCAATTAATTAAAATGCAGTACAGAGACATCCACGATTTCGTCAACATACCAGG
>TCONS_00000009 956:1555
AAAATAATAAACTTTGCAAAGAGCAAATTTGAAGAAGCAGTTGATATACGTCGAGAGATTTCTGCAACAG 
CGCGATTATATTACATTCAATTAATTAAAATGCAGTACAGAGACATCCACGATTTCGTCAACATACCAGG

I tried with the following biopython commands but it will only extract from file2 without the additional numbers that I need.

from Bio import SeqIO
id = []
for line in open("test.txt","r"):
    id.append(line.rstrip().strip('\t'))
for rec in SeqIO.parse("mymodified_transcript.fa","fasta"):
    if rec.id in id:
        print rec.format("fasta")

How can I keep the additional numbers and also extract the sequences from file2? Or replace the name in the file 2 by the ones in file 1? Thanks for your help

Was it helpful?

Solution

I got the solution. It worked fine in my Ubuntu. Please try this :)

from Bio import SeqIO
temp = {}
for line in open("test.txt","r"):
    i, c = line.strip().split()
    temp[i] = c

for rec in SeqIO.parse("mymodified_transcript.fa","fasta"):
    if str('>'+rec.id) in temp.keys():
        print str('>'+rec.id), temp['>'+rec.id]
        print str(rec.seq)

OTHER TIPS

Why not use a dictionary for the id lookup rather than a list? For e.g.

from Bio import SeqIO
id = {}
for line in open("test.txt","r"):
    i, c = line.strip().split()
    id[i] = c
for rec in SeqIO.parse("mymodified_transcript.fa","fasta"):
    if rec.id in id:
        print rec.id, id[rec.id]
        print rec.format("fasta")
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top