How to remove fasta formatted sequences that contain Ns

https://stackoverflow.com/questions/19734119

02-07-2022
|

Question

I have a fasta file like such

">ENS..._intronX
acgtacgtacgtacgt
">ENS..._intronY
acgtacgtNNNNa
acgtacgtacgtacgt
">ENS..._intronZ
acgtacgtacgtacgt
acgtacgtacgtacgt

I need to remove sequences with at least 2 N in a row (because these introns are misannotated).

Here, it would be sequence " >ENS..._intronY " (Line 3, 4, and 5 should be removed)

any suggestions?

Thank you,

Solution

With gawk

awk -v RS='">' '!/NN/{printf $0RT}' file
">ENS..._intronX
acgtacgtacgtacgt
">ENS..._intronZ
acgtacgtacgtacgt
acgtacgtacgtacgt

OTHER TIPS

Since it appears you're pursuing bioinformatics, consider becoming familiar with Bio::SeqIO, as it'll help with this and many other fasta parsing jobs:

use strict;
use warnings;
use Bio::SeqIO;

my $in = Bio::SeqIO->new( -file => shift, -format => 'Fasta' );

while ( my $seq = $in->next_seq() ) {
    print '>' . $seq->id . ' ' . $seq->desc . "\n" . $seq->seq . "\n"
      if $seq->seq !~ /nn/i;
}

Usage: perl script.pl inFile [>outFile]

The last, optional parameter directs output to a file.

Output on your dataset:

>ENS..._intronX 
acgtacgtacgtacgt
>ENS..._intronZ 
acgtacgtacgtacgtacgtacgtacgtacgt

Hope this helps!

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow