Merge specific rows from two files if number in row file 1 is between two numbers in row in file 2

https://stackoverflow.com/questions/16522052

21-04-2022
|

Question

I'm searching for a couple of hours (actually already two days) but I can't find an answer to my problem yet. I've tried Sed and Awk but I can't get the parameters right.

Essentially, this is what I'm looking for

FOR every line in file_1
IF [value in colum2 in file_1]
   IS EQUAL TO [value in column 4 in some row in file_2]
   OR IS EQUAL TO [value in column 5 in some row in file_2]
   OR IS BETWEEN [value column 4 and value column 5 in some row in file_2]
THAN
    ADD column 3, 6 and 7 of some row of file_2 to column 3, 4 and 5 of file_1

NB: Values that needs to be compared are INTs, values in col 3, 6 and 7 (that only needs to be copied) are STRINGs

And this is the context, but probably not necessary to read:

I've two files with genome data which I want to merge in a specific way (the columns are tab separated)

The first file contains variants (only SNPs for the ones interested) of which, efficiently, only the second column is relevant. This column is a list of numbers (position of that variant on the chromosome)
I have a structural annotation files that contains the following data:
- In column 4 is a begin position of the specific structure and in column 5 is the end position.
- Column 3, 7 and 9 contains information that describes the specific structure (name of a gene etc.)

I would like to annotate the variants in the first file with the data in the annotation file. Therefore, if the number in column 2 of the variants file is equal to column 4 or 5 OR between those values in a specific row, columns 3, 7 and 9 of that specific row in the annotation needs to be added.

Sample File 1

SOME_NON_RELEVANT_STRING    142
SOME_NON_RELEVANT_STRING    182
SOME_NON_RELEVANT_STRING    320
SOME_NON_RELEVANT_STRING    321
SOME_NON_RELEVANT_STRING    322
SOME_NON_RELEVANT_STRING    471
SOME_NON_RELEVANT_STRING    488
SOME_NON_RELEVANT_STRING    497
SOME_NON_RELEVANT_STRING    541
SOME_NON_RELEVANT_STRING    545
SOME_NON_RELEVANT_STRING    548
SOME_NON_RELEVANT_STRING    4105
SOME_NON_RELEVANT_STRING    15879
SOME_NON_RELEVANT_STRING    26534
SOME_NON_RELEVANT_STRING    30000
SOME_NON_RELEVANT_STRING    30001
SOME_NON_RELEVANT_STRING    40001
SOME_NON_RELEVANT_STRING    44752
SOME_NON_RELEVANT_STRING    50587
SOME_NON_RELEVANT_STRING    87512
SOME_NON_RELEVANT_STRING    96541
SOME_NON_RELEVANT_STRING    99541
SOME_NON_RELEVANT_STRING    99871

Sample File 2

SOME_NON_RELEVANT_STRING    SOME_NON_RELEVANT_STRING    A1  0   38  B1  C1
SOME_NON_RELEVANT_STRING    SOME_NON_RELEVANT_STRING    A2  40  2100    B2  C2
SOME_NON_RELEVANT_STRING    SOME_NON_RELEVANT_STRING    A3  2101    9999    B3  C3
SOME_NON_RELEVANT_STRING    SOME_NON_RELEVANT_STRING    A4  10000   15000   B4  C4
SOME_NON_RELEVANT_STRING    SOME_NON_RELEVANT_STRING    A5  15001   30000   B5  C5
SOME_NON_RELEVANT_STRING    SOME_NON_RELEVANT_STRING    A6  30001   40000   B6  C6
SOME_NON_RELEVANT_STRING    SOME_NON_RELEVANT_STRING    A7  40001   50001   B7  C7
SOME_NON_RELEVANT_STRING    SOME_NON_RELEVANT_STRING    A8  50001   50587   B8  C8
SOME_NON_RELEVANT_STRING    SOME_NON_RELEVANT_STRING    A9  50588   83054   B9  C9
SOME_NON_RELEVANT_STRING    SOME_NON_RELEVANT_STRING    A10 83055   98421   B10 C10
SOME_NON_RELEVANT_STRING    SOME_NON_RELEVANT_STRING    A11 98422   99999   B11 C11

Sample output file

142 A2  B2  C2
182 A2  B2  C2
320 A2  B2  C2
321 A2  B2  C2
322 A2  B2  C2
471 A2  B2  C2
488 A2  B2  C2
497 A2  B2  C2
541 A2  B2  C2
545 A2  B2  C2
548 A2  B2  C2
4105    A3  B3  C3
15879   A5  B5  C5
26534   A5  B5  C5
30000   A5  B5  C5
30001   A6  B6  C6
40001   A7  B7  C7
44752   A7  B7  C7
50587   A8  B8  C8
87512   A10 B10 C10
96541   A10 B10 C10
99541   A11 B11 C11
99871   A11 B11 C1

Solution

As a start, here's how to write the algorithm you posted in awk, assuming when you said "ADD" you meant "append" and assuming all lines in file1 have unique values of the 2nd field (ran against the sample input provided):

awk '
BEGIN{ FS=OFS="\t"; startIdx=1 }
NR==FNR {
    if ($2 in seen) {
         printf "%s on line %d, first seen on line %d\n", $2, NR, seen[$2] | "cat>&2"
    }
    else {
         f2s[++endIdx] = $2
         seen[$2] = NR
    }
    next
}
{
    inBounds = 1
    for (idx=startIdx; (idx<=endIdx) && inBounds; idx++) {
        f2 = f2s[idx]
        if (f2 >= $4) {
            if (f2 <= $5) {
                print f2, $3, $6, $7
            }
            else {
                inBounds = 0
            }
        }
        else {
            startIdx = idx
        }
    }
}
' file1 file2
142     A2      B2      C2
182     A2      B2      C2
320     A2      B2      C2
321     A2      B2      C2
322     A2      B2      C2
471     A2      B2      C2
488     A2      B2      C2
497     A2      B2      C2
541     A2      B2      C2
545     A2      B2      C2
548     A2      B2      C2
4105    A3      B3      C3
15879   A5      B5      C5
26534   A5      B5      C5
30000   A5      B5      C5
30001   A6      B6      C6
40001   A7      B7      C7
44752   A7      B7      C7
50587   A8      B8      C8
87512   A10     B10     C10
96541   A10     B10     C10
99541   A11     B11     C11
99871   A11     B11     C11

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow