Question

I have an input file which contains the following columns:

'-'    CT    C>CCT  
'-'    TA    G>GTA  
'-'    TAT    A>ATAT  

Basically, I am trying to test whether the final n characters after the arrow in column 3 are the same as the contents of column 2, where n is the difference in length between the letters before and after the arrow.

It seems that everything I've tried so far has thrown an error. I'm thinking along the following lines:

awk -F"\t" '{split($3,x,">");
{n_base=length(x[2])-length(x[1]);
ins={$x[2]: -$n_base};
if($2 == $ins) {print $0}}'

Any thoughts?

Thanks in advance.

Was it helpful?

Solution

You didn't show any sample output so this is a guess but it SOUNDS like all you want is:

$ awk -F'[\t>]' '$2==substr($4,length($3)+1)' file
'-'     CT      C>CCT
'-'     TA      G>GTA
'-'     TAT     A>ATAT

OTHER TIPS

I think this will do what you want:

awk -F'\t' '
        {
          split($3, parts, ">");
          fl = length(parts[2])
          check = substr(parts[2], fl-length($2)+1)
        }

        $2 == check {print}
        '
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top