Bash/Awk script for appending columns to single file based on searching multiple outside files

StackOverflow https://stackoverflow.com/questions/23663057

  •  22-07-2023
  •  | 
  •  

Question

I am trying to write a script that will take two files as input:

1) An annotated, tab-delimited file ("inFile") and
2) a file of variable length containing other annotated, tab-delimited files (identical formatting) to search with set_ids for each...

file1  set1
file2  set2
file3  set3

I want to output inFile, but with columns appended indicating whether each line of file_A is found in each of the sets to be searched.

This is my code so far

#!/bin/bash

inFile=$1
inSets=$2

set_filter () {
   set_name=$3
   awk -F"\t" ' BEGIN {OFS="\t"};
      {
         FNR == NR
            {
               idx=($1"."$2"."$3)
               keys[$idx]=$set_name
               next
            }
         {
            idx=($1"."$2"."$3)
            print $0, keys[$idx]
         }
      } ' $2 $1
   }

IFS=$'\n'
for line in $(cat $inSets); do

   set_file=$(echo $line | cut -f 1)
   set_id=$(echo $line | cut -f 2)

   ??? set_filter $inFile $set_file $set_id

done

My basic idea is to define a function that will perform the lookup for a single file and use that in a loop over all of the files to be searched, adding a column with each iteration. I'm having trouble with the loop, however, and was hoping somebody could point me in the right direction. Thanks!

EDIT

The annotated files look like

# inFile:
day  start  stop
1    100    102
1    300    350
2    100    200
3    200    400

So I'm looking for instances (rows) where the same day.start.stop appears in one of the sets being searched. If set1 is:

day  start  stop
1    100    102
1    700    750
2    800    900
3    900    950

and set 2 is:

day  start  stop
3    200    400
1    100    102
2    800    880
1    300    350

Then the output should look like:

day  start  stop
1    100    102  set1  set2
1    300    350        set2 
2    100    200
3    200    400        set2
Was it helpful?

Solution

Here is one way using awk:

awk '
FILENAME != "infile" {
    line[FILENAME,$0] = FILENAME
    next
}
FNR > 1 {
    printf "%s", $0
    for (x in line) {
        split (x, t, SUBSEP)
        if (t[2] == $0) {
            sep = FS
            printf "%s%s", sep, line[x]
        }
    }
    print "";
    next
}1' set1 set2 infile 
day  start  stop
1    100    102 set2 set1 
1    300    350 set2 
2    100    200 
3    200    400 set2 

You can keep adding sets just ensure your infile is at the very end.

OTHER TIPS

Here's another all awk answer. Create the following executable awk file:

#!/usr/bin/awk -f

BEGIN {DELIM=","; OFS="\t"} # DELIM should just be different than FS/data

# reformat input, set up some arrays
NR==FNR {
    line = $1 OFS $2 OFS $3   # replace with $0 if first file is tab delimited  
    if(FNR==1) header=line
    else { a[$2$3]=line; order[FNR-1]=$2$3; cnt++ }
    next
}

FILENAME!=last_filename { f[FILENAME]=++fcnt; last_filename=FILENAME }

$2$3 in a { a[$2$3]=a[$2$3] DELIM FILENAME }

# loop over lines in input file, adjusting formatting of lines in a[] with f[]  
END {
    print header
    for(i=1;i<=cnt;i++) { 
        split(a[order[i]], oarr, DELIM)
        printf( "%s", oarr[1] )
        k=2
        for(j=1;j<=fcnt;j++) {
            fname=oarr[k]
            if( f[fname]==j ) {o=fname; k++}
            else o=""
            printf( "%s%s", OFS, o )
        }
        print ""
    }
}

When put into a file called awko it can be run like awko infile set*:

day     start   stop
1       100     102     set1    set2
1       300     350             set2
2       100     200             
3       200     400             set2

The generic breakdown:

  • store the first file in some arrays, variables
  • create an array of files being tested in argument order - used for alignment
  • append any matched file names to the matched line in a[]
  • at the end, print out each line in a[] in order, reformatting to align matches

The line variable exists because the data in the question lost it's tabs in translation.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top