Question

I have a fasta file like such

">ENS..._intronX
acgtacgtacgtacgt
">ENS..._intronY
acgtacgtNNNNa
acgtacgtacgtacgt
">ENS..._intronZ
acgtacgtacgtacgt
acgtacgtacgtacgt

I need to remove sequences with at least 2 N in a row (because these introns are misannotated).

Here, it would be sequence " >ENS..._intronY " (Line 3, 4, and 5 should be removed)

any suggestions?

Thank you,

Was it helpful?

Solution

With

awk -v RS='">' '!/NN/{printf $0RT}' file
">ENS..._intronX
acgtacgtacgtacgt
">ENS..._intronZ
acgtacgtacgtacgt
acgtacgtacgtacgt    

OTHER TIPS

Since it appears you're pursuing bioinformatics, consider becoming familiar with Bio::SeqIO, as it'll help with this and many other fasta parsing jobs:

use strict;
use warnings;
use Bio::SeqIO;

my $in = Bio::SeqIO->new( -file => shift, -format => 'Fasta' );

while ( my $seq = $in->next_seq() ) {
    print '>' . $seq->id . ' ' . $seq->desc . "\n" . $seq->seq . "\n"
      if $seq->seq !~ /nn/i;
}

Usage: perl script.pl inFile [>outFile]

The last, optional parameter directs output to a file.

Output on your dataset:

>ENS..._intronX 
acgtacgtacgtacgt
>ENS..._intronZ 
acgtacgtacgtacgtacgtacgtacgtacgt

Hope this helps!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top