Question

I am using a column in one file to look up values in another file. The second file is very large and I would like to find all the values in a single pass with awk. I have tried doing this with an associative array, but am stumped how to get the out put i want. I want to take F1, use $2 to look up values in F2, and get the output I show below, which is $0 from F1 as the header, followed by $10 from F2 sorted and counted for each unique string ( ie pipped through sort | uniq -c).

F1

+ID=dnaK.p01 12121 TTGGGCAGTTGAAACCAGACGTTTCGCCCCTATTACAGAC[T]CACAACCACATGATGACCG

F2

solid309_20110930_FRAG_BC_bcSample12273_1541_657_F3     0       NC_012759       12121   42      35M     *       0       0       ACACAACCACATGATGACCGAATATATAGTGGCTC     BBBBBBA@BBBAB@?B@BBBB<5BBBAA@:>>&B7
solid309_20110930_FRAG_BC_bcSample12295_323_1714_F3     0       NC_012759       12121   42      35M     *       0       0       ACACAACCACATGATGACCGAATATATAGTGGAGA     BB@@A@@A@@@?@<=?@@=><6*7=?9993>4&7,
solid309_20110930_FRAG_BC_bcSample12325_1148_609_F3     0       NC_012759       12121   42      35M     *       0       0       ACACAACCACATGATGACCGAATATATAGTGGAGA     BBBB@B@?@B@@A@??BBBA@<.<==:6:1>9(<-
solid309_20110930_FRAG_BC_bcSample11796_1531_1170_F3    0       NC_012759       12122   42      35M     *       0       0       CACAACCACATGATGACCGAATATATAGTGGAGCA     '&&+&&)&')&0(.,',(.3+&&&+,&&&&&&&&&
solid309_20110930_FRAG_BC_bcSample12110_1166_1149_F3    0       NC_012759       12122   42      35M     *       0       0       CACAACCACATGATGACCGAATATATAGTGGAGAC     -(:18)538;,9277*'8:<)&,0-+)//3&'1+'
solid309_20110930_FRAG_BC_bcSample183_686_962_F3        0       NC_012759       12123   42      35M     *       0       0       ACAACCACATGATGACCGAATATATAGTGGAGTGC     BB?BBBB;BBBBBB@ABB;@7AA@@A@*>?+B8@9

I am doing this with the following script

for line in `awk '{if ($1~"-") print ($2-34);else print $2}' $1`
do
awk -v l=$line '{if ($1~"-") l=l+34;if ($2==l) print }' $1 >> f2
awk -v l=$line '{if ($4==l) print $10}' URA2.sam | sort | uniq -c |awk '{if ($1>15) print}'>> f2
done

Which requires multiple passes with awk for each line in. I was thinking I could use an associative array made from F1 to do this with one pass. F2 is sorted by $4. I used the following script to try to get the output I wanted.

awk 'FNR==NR{a[$2]=$0;next}$4 in a{print $10}' f1 f2 | sort | uniq -c
Was it helpful?

Solution

Here's the output using GNU awk for asorti() and delete array:

$ cat tst.awk
function prtCounts(     val,sorted,idx) {
    if (prev in f1) {
        print f1[prev]
        asorti(count,sorted)
        for (idx=1; idx in sorted; idx++) {
            val = sorted[idx]
            print count[val], val
        }
    }
    delete count
}

NR==FNR { f1[$2] = $0; next }

{
    if ( (FNR>1) && ($4!=prev) )
        prtCounts()
    count[$10]++
    prev = $4
}

END { prtCounts() }

$ gawk -f tst.awk file1 file2
a 1 b c d
1 BALH
2 BLAH
b 2 b c d
1 HAHA
2 ZAHA

OTHER TIPS

I would process F1 first and use an associate array to save, the second field as the key and the whole line as the value. Then you only will have to process F2 once, and each time that fourth field changes, print the number of repeated values in the tenth field, like:

awk '
    ## Process F1.
    FNR == NR {
        f1[$2] = $0
        next
    }

    ## Process F2.
    {
        if (FNR == 1 || prev_key == $4) {
            f2[$10]++
        }
        else {
            printf "%s\n", f1[prev_key]
            for (key in f2) {
                printf "%d %s\n", f2[key], key
            }
            delete f2
            f2[$10]++
        }
        prev_key = $4
    }

    END {
        printf "%s\n", f1[prev_key]
        for (key in f2) {
            printf "%d %s\n", f2[key], key
        }
    }
' F1 F2

It yields:

a 1 b c d
1 BALH
2 BLAH
b 2 b c d
2 ZAHA
1 HAHA
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top