Pregunta

I am trying to replace three letter code at the end of a sequence with nothing (basically removing) with sed but is not working well for multiple regex pattern. Here is an example of sequences

GCAAAAAGTTGTATAGTCACACAACCTAGACTTATATCGTCTGCTATTCATTAG
GCAAAAAGTTGTATAGTCACACAACCTAGACTTATATCGTCTGCTATTCATTAA
GCAAAAAGTTGTATAGTCACACAACCTAGACTTATATCGTCTGCTATTCATTGA

When I try to use regex individually with sed it works

echo "GCAAAAAGTTGTATAGTCACACAACCTAGACTTATATCGTCTGCTATTCATTAG" | sed 's/TAG$//'
echo "GCAAAAAGTTGTATAGTCACACAACCTAGACTTATATCGTCTGCTATTCATTAA" | sed 's/TAA$//'
echo "GCAAAAAGTTGTATAGTCACACAACCTAGACTTATATCGTCTGCTATTCATTAG" | sed 's/TAG$//'

However when I try to include multiple regex it doesn't work

echo "GCAAAAAGTTGTATAGTCACACAACCTAGACTTATATCGTCTGCTATTCATTAG" |
sed 's/(TAG$|TAA$|TGA$)//'

Could somebody point to me where I am doing wrong?

¿Fue útil?

Solución

You need to use extended regex switch in sed:

sed -r 's/(TAG|TAA|TGA)$//'

OR on OSX:

sed -E 's/(TAG|TAA|TGA)$//'

Or this sed without extended regex (doesn't work on OSX though):

sed 's/\(TAG\|TAA\|TGA\)$//'

Otros consejos

You need to escape the RE metacharacters | and parens.

sed 's/\(TAG$\|TAA$\|TGA$\)//'

or you can use the portable option -E to prevent escaping. -E enable extended regular expressions, so your original command will run without any issues.

for non GNU sed (or with --posix option) where | is not available

If TGG is not occuring or could be included

sed 's/T[AG][AG]$//' YourFile

if not

sed 's/T[AG]A$//;s/TAA$//' YourFile

By default, sed uses Basic Regular Expressions, which requires escaping parentheses and pipes:

sed 's/\(TAG\|TAA\|TGA\)$//'

Recent versions of sed also support the -r option to use Extended Regular Expressions:

sed -r 's/(TAG|TAA|TGA)$//'

I don't think this will be that helpful for you, but if you want to remove just the last 3 characters regardless:

sed 's/...$//'

awk can also be used if you like to try some other solution:

awk '{sub(/(TAG|TAA|TGA)$/,"")}1' file
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top