Question

Lets say I have a tab delimited file lookup.txt

070-031 070-291 030-031
1   2   X
2   3   1
3   4   2
4   5   3
5   6   4
6   7   5
7   8   6
8   9   7

And I have the following files with values to lookup from

$cat 030-031.txt
Line1   070-291 4
Line2   070-031 3

$cat 070-031.txt
Line1   030-031 5
Line2   070-291 8

I would like script.awk to return

$script.awk 030-031.txt lookup.txt
Line1   070-291 4   2
Line2   070-031 3   2

and

$script.awk 070-031.txt lookup.txt
Line1   030-031 5   6
Line2   070-291 8   7

The only thing I can think to do is to create two separate expanded lookup.txt eg

$cat lookup_030-031.txt
070-031:1   X
070-031:2   1
070-031:3   2
070-031:4   3
070-031:5   4
070-031:6   5
070-031:7   6
070-031:8   7
070-291:2   X
070-291:3   1
070-291:4   2
070-291:5   3
070-291:6   4
070-291:7   5
070-291:8   6
070-291:9   7

and then

awk 'NR==FNR { a[$1]=$2;next}{print $0,a[$2":"$3]}' lookup_030-031.txt 030-031.txt

This works but I have many more columns and approximately 10000 rows, so i'd rather not have to generate a lookup file for each. Many thanks

AMENDED

Glenn Jackman's answer is a perfect solution to the initial question and his second answer is more efficient. However, I forgot to stipulate that the script should handle duplicates. For instance, it should be able to handle

$cat 030-031
070-031 3
070-031 6

and return BOTH corresponding numbers for the respective file (2 and 5 respectively). Only Glens first answer handles repeated lookups. His second returns the last values found.

Was it helpful?

Solution

OK, I see now. You have to read the lookup file into a big datastructure, then referencing with the individual files is easy.

$ cat script.awk 
BEGIN {OFS = "\t"}
NR==1 {
    for (i=1; i<=NF; i++) 
        label[i] = $i
    next
}
NR==FNR {
    for (i=1; i<=NF; i++) 
        for (j=1; j<=NF; j++) 
            if (i != j) 
                value[label[i],$i,label[j]] = $j
    next
}
FNR==1 {
    split(FILENAME, a, /\./)
    j = a[1]
}
{
    $(NF+1) = value[$1,$2,j]
    print
}

$ awk -f script.awk lookup.txt 030-031.txt
070-291 4   2
070-031 3   2

$ awk -f script.awk lookup.txt 070-031.txt 
030-031 5   6
070-291 8   7

This version is a bit more compact, and passes the filenames in your preferred order:

$  script.awk 
BEGIN {OFS = "\t"}

NR==1 {
    split(FILENAME, a, /\./)
    dest = a[1]
}
NR==FNR {
    src[$1]=$2
    next
}
FNR==1 {
    for (i=1; i<=NF; i++)
        col[$i]=i
    next
}

{
    for (from in src)
        if ($col[from] == src[from])
            print from, src[from], $col[dest]
}

$ awk -f script.awk  030-031.txt   lookup.txt 
070-031 3   2
070-291 4   2

$ awk -f script.awk  070-031.txt  lookup.txt 
030-031 5   6
070-291 8   7

OTHER TIPS

This works but I have many more columns and approximately 10000 rows, so i'd rather not have to generate a lookup file for each.

Your dataset is small enough to where you have the option of keeping the lookups in memory.

In a BEGIN section, read "lookup.txt" into a two-dimension (nested) array so that:

lookup['070-031'][4] = 3  
lookup['070-291'][5] = 3  

The run through all the data files all at once:

script.awk 070-031.txt 070-291.txt
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top