get a specific sequence from a fasta file with Regex

https://stackoverflow.com/questions/17225019

01-06-2022
|

Question

I would like to retrieve the n^th sequence (or preferably n^th to m^th sequence) from a input fasta file, ideally with a unix "one-liner".

I know I could read the sequence with perl (or any other scripting language), count, and then print the sequence, but I'm looking for something faster and more compact.

For those unaware, a sample fasta file looks like the following:

>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

Solution

One way with awk:

awk -v RS='>' -v start=$n -v end=$m 'NR>=(start+1)&&NR<=(end+1){print ">"$0}' fasta_file

OTHER TIPS

Here are two ways using awk.

If your sequences are wrapped 1 per line, this would work:

awk -v n=5 -v m=8 'NR == n * 2 - 1, NR == m * 2' file.fa

If your sequence lines aren't wrapped, then this may be more appropriate:

awk -v n=5 -v m=8 '/^>/ { c++ } c == n { f=1 } c == m + 1 { f=0 } f' file.fa

With sed:

sed -n '/SEQUENCE_'$n'/,/SEQUENCE_'$(($m + 1))'/p' input | sed '$d'

sed one liner (no pipe | needed):

sed '/>SEQUENCE_'$n'/, />SEQUENCE_'$(($m + 1))'/!d;{/>SEQUENCE_'$(($m + 1))'/d}' file

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow