Symbol Replacement in files based on matching strings

https://stackoverflow.com/questions/23513630

regex
awk

16-07-2023
|

سؤال

With the help of members of this group, I have managed to write a simple awk script that matches the first column of "subfile" (Approved Symbol) with both the columns of "file" and replaces unmatched elements with "NA" in "file".

Apart from matching only first column, I also need to include rest of the two columns of "subfile" (Previous Symbols and Synonyms) for matching.

Overall, I have a straight-forward problem. If any element in "file" matches with any element in any of the three columns of "subfile", the matched element of "file" should be replaced by element of first column (i.e by Approved Symbol) of "subfile".

The script I have written:

awk 'FNR==NR {a[$1]=$1;next}
{
for (i=1;i<=NF;i++)
{
$i = ($i in a) ? a[$i] : "NA"
}
}
1' subfile file

subfile

Approved Symbol     Previous Symbols       Synonyms
A1BG
A1CF                                       ACF, ASP, ACF64, ACF65, APOBEC1CF
A2ML1               CPAMD9                 FLJ25179
AAAS    
AAR2                C20orf4                bA234K24.2
MAP2K4              SERK1                  MEK4, JNKK1, PRKMK4, MKK4  
FLNC                FLN2                   ABP-280, ABPL
MYPN                                       MYOP
ACTN2

file

MAP2K4  FLNC
MYPN    ACTN2
EIF2C2  MIRLET7B
EIF2C2  MIRLET7I

Any suggestions please.

المحلول

I realize you are looking for an awk solution, but your question struck me as one that could benefit from the power of python dictionaries. Below is a python script that performs your stated goal: to match all elements from file with entries in subfile, and output the appropriate Approved Symbol from subfile, or NA otherwise.

Please note that this is written for Python 3.x -- but not hard to modify slightly for Python 2.x.

# Build dictionary of approved symbols from synonyms
approved_symbols = {}
with open("subfile") as subfile:
  subfile.readline() # skip header line
  for line in subfile:
    columns = line.strip().split() # split into columns on whitespace
    approved = columns[0]

    for col in columns:
      synonyms = col.split(',') # split into elements on comma
      # Add each synonym to dictionary
      for syn in synonyms:
        approved_symbols[syn] = approved

# Process file
with open("file") as file:
  for line in file:
    for element in line.strip().split():
      # If symbol found, print it, otherwise output "NA".
      print(approved_symbols.get(element, "NA"), end='\t')
    print('')

Output:

MAP2K4 FLNC  
MYPN   ACTN2  
NA     NA  
NA     NA

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow