With gawk
awk -v RS='">' '!/NN/{printf $0RT}' file
">ENS..._intronX
acgtacgtacgtacgt
">ENS..._intronZ
acgtacgtacgtacgt
acgtacgtacgtacgt
Question
I have a fasta file like such
">ENS..._intronX
acgtacgtacgtacgt
">ENS..._intronY
acgtacgtNNNNa
acgtacgtacgtacgt
">ENS..._intronZ
acgtacgtacgtacgt
acgtacgtacgtacgt
I need to remove sequences with at least 2 N
in a row (because these introns are misannotated).
Here, it would be sequence " >ENS..._intronY "
(Line 3, 4, and 5 should be removed)
any suggestions?
Thank you,
Solution
With gawk
awk -v RS='">' '!/NN/{printf $0RT}' file
">ENS..._intronX
acgtacgtacgtacgt
">ENS..._intronZ
acgtacgtacgtacgt
acgtacgtacgtacgt
OTHER TIPS
Since it appears you're pursuing bioinformatics, consider becoming familiar with Bio::SeqIO, as it'll help with this and many other fasta parsing jobs:
use strict;
use warnings;
use Bio::SeqIO;
my $in = Bio::SeqIO->new( -file => shift, -format => 'Fasta' );
while ( my $seq = $in->next_seq() ) {
print '>' . $seq->id . ' ' . $seq->desc . "\n" . $seq->seq . "\n"
if $seq->seq !~ /nn/i;
}
Usage: perl script.pl inFile [>outFile]
The last, optional parameter directs output to a file.
Output on your dataset:
>ENS..._intronX
acgtacgtacgtacgt
>ENS..._intronZ
acgtacgtacgtacgtacgtacgtacgtacgt
Hope this helps!