Extract rows and substrings from one file conditional on information of another file

Question 1

As either thunk and msw have pointed out, more suitable tools are available for this kind of task but here you have a script that can teach you something about how to handle it with awk:

Content of script.awk:

## Process first file from arguments.
FNR == NR {
        ## Save ID and the range of characters to remove from sequence.
        blast[ $1 ] = $(NF-1) " " $NF
        next
}

## Process second file. For each FASTA id...
$1 ~ /^>/ {
        ## Get number.
        id = substr( $1, 2 )

        ## Read next line (the sequence).
        getline sequence

        ## if the ID is one found in the other file, get ranges and
        ## extract those characters from sequence.
        if ( id in blast ) {
                split( blast[id], ranges )
                sequence = substr( sequence, 1, ranges[1] - 1 ) substr( sequence, ranges[2] + 1 )
                ## Print both lines with the shortened sequence.
                printf "%s\n%s\n", $0, sequence
        }

}

Assuming your 1.blasta of the question and a customized 1.fasta to test it:

>1
TCGACTAGCTACGACTCGGACTGACGAGCTACGACTACGG
>2
GCATCTGGGCTACGGGATCAGCTAGGCGATGCGAC
>27620
TTTGCGAGCGCGAAGCGACGACGAGCAGCAGCGACTCTAGCTACTGTTTGCGA

Run the script like:

awk -f script.awk 1.blast 1.fasta

That yields:

>1
ACTAGCTACGACTCGGACTGACGAGCTACGACTACGG
>27620
TTTGCGA

Of course I'm assumming some things, the most important that fasta sequences are not longer than one line.

Question 2

If you do bioinformatics and work with DNA sequences (or even more complicated things like sequence annotations), I would recommend having a look at Bioperl. This obviously requires knowledge of Perl, but has quite a lot of functionality.

In your case you would want to generate Bio::Seq objects from your fasta-file using the Bio::SeqIO module.

Then, you would need to read the fasta-entry-numbers and positions wanted into a hash. With the fasta-name as the key and the value being an array of two values for each subsequence you want to extract. If there can be more than one such subsequence per fasta-entry, you would have to create an array of arrays as the value entry for each key.

With this data structure, you could then go ahead and extract the sequences using the subseq method from Bio::Seq.

I hope this is a way to go for you, although I'm sure that this is also feasible with pure bash.

Question 3

This isn't an answer, it is an attempt to clarify your problem; please let me know if I have gotten the nature of your task correct.

foreach row in blast:
    get the proper (blast[$1]) sequence from fasta
    drop bases (blast[$7..$8]) from sequence
    print blast[$1], shortened_sequence

If I've got your task correct, you are being hobbled by your programming language (bash) and the peculiar format of your data (a record split across rows). Perl or Python would be far more suitable to the task; indeed Perl was written in part because multiple file access in awk of the time was really difficult if not impossible.

You've come pretty far with the tools you know, but it looks like you are hitting the limits of their convenient expressibility.

Question 4

Updated the answer:

awk  '
NR==FNR && NF { 
    id=substr($1,2)
    getline seq
    a[id]=seq
    next 
} 
($1 in a) && NF { 
    x=substr(a[$1],$7,$8)
    sub(x, "", a[$1])
    print ">"$1"\n"a[$1]
} ' 1.fasta 1.blast