Question

I would like to remove what follows the forth occurrence of the character ":" in any field contains it. See the example:

Input:

1 10975     A C    1/1:137,105:245:99:1007,102,0   0/1:219,27:248:20:222,0,20 
1 19938     T TA   ./.                             1/1:0,167:167:99:4432,422,0,12,12
12 20043112 C G    1/2:3,5,0:15:92                 2/2:3,15:20:8

Expected output:

1 10975     A C    1/1:137,105:245:99   0/1:219,27:248:20 
1 19938     T TA   ./.                  1/1:0,167:167:99
12 20043112 C G    1/2:3,5,0:15:92      2/2:3,15:20:8

So Basically any field that has ":", what follows its forth occurrence should be removed. Note that the third line nothing change because ":" appears three times only. I have tried and found a solution (not good) which didn't work only for the first line and not the secod as it has more commas ","

Incomplete Solution:

sed 's/:[0-9]*,[0-9]*,[0-9]*//g'

Thanks in advance

Was it helpful?

Solution 3

On fields 5 through to the last field, this will remove the fourth occurrence of the regexp :[^:]+

< file.txt awk '{ for (i=5; i<=NF; i++) $i = gensub(/:[^:]+/, "", 4, $i) }1' | column -t

On fields 5 through to the last field, this will remove everything after the fourth :

< file awk '{ for (i=5; i<=NF; i++) $i = gensub(/((:[^:]+){3}).*/, "\\1", 1, $i) }1' | column -t

Explanation:

Upon re-reading your question, the second solution is probably what you're looking for. The first solution looks for a colon followed by one or more characters not a colon and removes them. The third argument to gensub() describes which match of the regexp to replace. So a 4 tells gensub() to remove the fourth match of the pattern. The second solution, looks for three sets of the regexp described in the first answer. At this point it's worth mentioning that gensub() provides an additional feature that is not available using sub() or gsub(). This is the ability to specify components of a regexp in the replacement text, much like how other languages use parentheses to perform capturing. gensub() is a very powerful command only available using GNU awk. The description and example provided here are very useful. HTH.

Results:

1   10975     A  C   1/1:137,105:245:99  0/1:219,27:248:20
1   19938     T  TA  ./.                 1/1:0,167:167:99
12  20043112  C  G   1/2:3,5,0:15:92     2/2:3,15:20:8

OTHER TIPS

Sed:

sed -r 's/((:[^: \t]*){3}):[^ \t]*/\1/g' file | column -t

Perl:

perl -pe 's/((:\S*){3}):\S*/$1/g' file | column -t

Using sed

sed -r 's/((:[^ ]*){3}):[^ ]*/\1/g' file

Output:

1 10975     A C    1/1:137,105:245:99   0/1:219,27:248:20 
1 19938     T TA   ./.                             1/1:0,167:167:99
12 20043112 C G    1/2:3,5,0:15:92                 2/2:3,15:20:8

Using perl

perl -pe 's/((:\S*){3}):\S*/$1/g' file
perl -lane 's/(.*?:.*?:.*?:.*?):.*/$1/g  for @F ; printf "@F"."\n"' your_file
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top