Question

i have a list of query and hits gi in one file (file1) . i have another file in which complete name of hits is there(file2), now i want to replace Hits gi from file1 to file2 that have the complete Hit name. i want that gi must be replace with the same gi in front of it's each corresponding Query.

file1

 1  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148659820 ref_YP_001281343.1_
 2  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148821250 ref_YP_001286004.1_ 
 3  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_15607202 ref_NP_214574.1_  
 4  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_253796975 ref_YP_003029976.1_ 
 5  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_375294260 ref_YP_005098527.1_ 

file2

1  >gi_375294260_ref_YP_005098527.1_ hypothetical protein TBSG_00059 [Mycobacterium tuberculosis KZN 4207]
2  >gi_253796975_ref_YP_003029976.1_ hypothetical protein TBMG_00059 [Mycobacterium tuberculosis KZN 1435]
3  >gi_15607202_ref_NP_214574.1_ Conserved hypothetical protein [Mycobacterium tuberculosis H37Rv]
4  >gi_148659820_ref_YP_001281343.1_ hypothetical protein MRA_0062 [Mycobacterium tuberculosis H37Ra]
5  >gi_148821250_ref_YP_001286004.1_ hypothetical protein TBFG_10059 [Mycobacterium tuberculosis F11]

desired output:

1  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148659820_ref_YP_001281343.1_ hypothetical protein MRA_0062 [Mycobacterium tuberculosis H37Ra]
2  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148821250_ref_YP_001286004.1_ hypothetical protein TBFG_10059 [Mycobacterium tuberculosis F11]
3  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_15607202_ref_NP_214574.1_ Conserved hypothetical protein [Mycobacterium tuberculosis H37Rv
4  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_253796975_ref_YP_003029976.1_ hypothetical protein TBMG_00059 [Mycobacterium tuberculosis KZN 1435]
5  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_375294260_ref_YP_005098527.1_ hypothetical protein TBSG_00059 [Mycobacterium tuberculosis KZN 4207]
Was it helpful?

Solution

The solution is described stepwise;

  1. Extract only Hit GIs from file1;

    cat file1 | awk '{print $3}' | sed 's/Hit=//g' > file1-gi
    
  2. Remove # > from file 2.;

    sed 's/^....//g' file2 > file2_1
    
  3. Remove redundancy in file2, if any;

    cat file2_1 | sort $1 | uniq > file2_2
    
  4. Use system command to grep the names of corresponding GIs;

    cat file1-gi | awk '{system ("grep "$1" file2_2")}' >> file1-gi-name
    
  5. Printing starting 3 columns of file1;

    cut -d" " -f-3 file1 > file1_1
    
  6. Paste two files;

    paste file1_1 file1-gi-name > output
    

OTHER TIPS

One of the fastest (and dummy) solution is to use the search method from pyhton - re to match the patterns in your string. I wrote an example of how you could do it (you must do some checks to see if the results are right...):

import re

file2 = open(f2path, "r")
file1 = open(f1path, "r")
file3 = open(f3path, "w")
namesD = dict()

for lineO in file2:
    strH = re.search(" ", line0)
    idN = line0[1:strH.begin()]
    namesD[idN] = line0[strH.end():]

for lineO in file1:
    strH = re.search("Hit=", line0)
    idN = line0[strH.end():].strip().replace(' ', '_')
    if namesD[idN] : 
        file3.write("Hit=" + idN + namesD[idN])

The idea is to first extract the ids and their names from file 2 and add them in a dict (the id is the key, the name is the value) and afterwards you should read line by line the first file and extract the ID from the hit and try to match it in the dict. If they match, you can write the result in a 3rd file... or do whatever you want with it

If I run:

file1=file1.txt; file2=$(cat file2.txt|sed -e "s/>gi/Query=gi/g"|sed -e "s/_ref_/ ref_/g");IFS='\n';echo $file2| awk  'NR==FNR { _[$2]=$2;  f1_line[key] = $4" "$5" "$6" "$7" "$8" "$9" "$10 } NR!=FNR { if(_[$2] != "") print $0" "f1_line[key]}'  - $file1 

To explain what it is doing as a script, usage described below, I set the file as file1.rasta in the script so it would require input from me:

./run.sh 
-------------------------------------------------------------------------------
No variables defined settings files as:
fil1=file1.rasta
file2=file2.rasta
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
One of the following: 
 file1=file1.rasta
file2=file2.rasta
 does not exist!
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
usage:
./run.sh file1.fasta file2.fasta is the same as line below
./run.sh ./file1.fasta ./file2.fasta
-- This is if files are elsewhere
./run.sh /path/to/file1.fasta /path/to/file2.fasta
-------------------------------------------------------------------------------

Running it:

./run.sh ./file1.fasta ./file2.fasta 
1  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148659820 ref_YP_001281343.1_ hypothetical protei TBFG_10059 [Mycobacterium tuberculosis F11] 
2  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148821250 ref_YP_001286004.1_  hypothetical protei TBFG_10059 [Mycobacterium tuberculosis F11] 
3  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_15607202 ref_NP_214574.1_   hypothetical protei TBFG_10059 [Mycobacterium tuberculosis F11] 
4  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_253796975 ref_YP_003029976.1_  hypothetical protei TBFG_10059 [Mycobacterium tuberculosis F11] 
5  Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_375294260 ref_YP_005098527.1_  hypothetical protei TBFG_10059 [Mycobacterium tuberculosis F11] 

bash script run.sh which is the 1 liner above but broken down with explanation:

#!/bin/bash

 function line() { 
  echo  -e "-------------------------------------------------------------------------------"
 }

 function usage() { 
  line;
  echo "usage:"
  echo $0 file1.fasta file2.fasta is the same as  line below
  echo $0 ./file1.fasta ./file2.fasta
  echo -- This is if files are elsewhere
  echo $0 /path/to/file1.fasta /path/to/file2.fasta
  line;
 } 


 file1=$1;
 file2=$2;

 if [ $# -lt 2 ]; then 
    # Set file1 variable as filename file1.fasta 
    # ensure this file exists in current path
    # otherwise:
    # file1=/path/to/file1.fasta
    file1=file1.rasta; 


    # Set file2 variable as filename file2.fasta 
    # ensure this file exists in current path
    # otherwise:
    # file2=/path/to/file1.fasta

    file2=file2.rasta;
    line;
    echo -e "No variables defined settings files as:\nfil1=$file1\nfile2=$file2";
    line;
fi
 # Check we have both files whether its variables or if not variables
 # matches defined files
 if  [ ! -f  $file1 ]  || [ ! -f  $file2 ]; then
  line;
  echo -e "One of the following: \n file1=$file1\nfile2=$file2\n does not exist!"
  line
  usage
  exit 2;
 fi


 # Define file 2 variable which cats file2.fasta again like above ensure 
 # the file2.fasta can be catted from this path, it pipes it into sed and changes:
 # '>gi' to 'Query=gi' and also changes '_ref_'  to ' ref_'
 # this now matches the same pattern as file1

 cfile2=$(sed -e "s/>gi/Query=gi/g" -e "s/_ref_/ ref_/g" $file2);

 # Set the internal field separator to \n which is the output of variable file2 
 IFS='\n';

  # debug enable this if you now want to see manipulated file2
  # echo $cfile2

 # Echo out cfile2 which now with the above ifs makes it like the file 
 #  formatting making \n the separator - pipe into awk command which 
 # matches against both files
 # Set up a key whilst in one which contains pattern match after:
 # .{number}_{space}* where this is what separates file2's content where tag starts.
 # If the values from $2 match on both lines print out $0 which is everything from file1 
 # plus the key which contains the details
 # the echo $cfile2  is then represented as - before $file1  at the end in effect its the first file value which is the call to file1 

echo $cfile2| awk 'NR==FNR { 
  _[$2]=$2; 
  if( match($0, /\.[0-9]\_ /)) { 
    var1=substr($0, RSTART+3);  
   }
  } 
  NR!=FNR { 
     if(_[$2] != "") print $0" "var1
  }' - $file1

## Method used originally - updated to above which is much cleaner
## pattern matches and then from that point it captures entire string which would 
## ensure it captures the entire tag from file2
 ##echo $cfile2| awk 'NR==FNR { 
 ## _[$2]=$2; 
 ## f1_line[key] = $4" "$5" "$6" "$7" "$8" "$9" "$10 
 ## } 
 ## NR!=FNR { 
 ##    if(_[$2] != "") print $0" "f1_line[key]
 ## }' - $file1
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top