You didn't show any sample output so this is a guess but it SOUNDS like all you want is:
$ awk -F'[\t>]' '$2==substr($4,length($3)+1)' file
'-' CT C>CCT
'-' TA G>GTA
'-' TAT A>ATAT
Question
I have an input file which contains the following columns:
'-' CT C>CCT
'-' TA G>GTA
'-' TAT A>ATAT
Basically, I am trying to test whether the final n characters after the arrow in column 3 are the same as the contents of column 2, where n is the difference in length between the letters before and after the arrow.
It seems that everything I've tried so far has thrown an error. I'm thinking along the following lines:
awk -F"\t" '{split($3,x,">");
{n_base=length(x[2])-length(x[1]);
ins={$x[2]: -$n_base};
if($2 == $ins) {print $0}}'
Any thoughts?
Thanks in advance.
Solution
You didn't show any sample output so this is a guess but it SOUNDS like all you want is:
$ awk -F'[\t>]' '$2==substr($4,length($3)+1)' file
'-' CT C>CCT
'-' TA G>GTA
'-' TAT A>ATAT
OTHER TIPS
I think this will do what you want:
awk -F'\t' '
{
split($3, parts, ">");
fl = length(parts[2])
check = substr(parts[2], fl-length($2)+1)
}
$2 == check {print}
'