Question

I'm running sed on Mac Os X Snow Leopard.

sed is/should be BSD sed (man page is dated 2005-05-10) man page states:

The sed utility is expected to be a superset of 
the IEEE Std 1003.2 (``POSIX.2'') specification.

When I'm trying a replacement, and the input stream holds characters larger than ascii 127, the dot does not match this character.

e.g.

echo -e "a001\0001a - a127\0177a - a128\0200a - a255\0377a - a061\0075a" \
| sed -e 's/a[0-9]\{3\}.a/match/g;' ;
echo "result: $?";

results in output:

match - match - a128?a - a255?a - match
result: 0

On Os X Maverick (stating the same manual page), the result gives an error:

sed: RE error: illegal byte sequence
result: 1

On a linux Mint 13 system, the same instruction returns (my expectation):

match - match - match - match - match
result: 0

according to http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html#tag_09_03 the '.' should match

"any character in the supported character set except NUL".

If running this similar command (gsed version 4.2.1 on Os X Snow Leopard):

echo -e "a001\0001a - a127\0177a - a128\0200a - a255\0377a - a061\0075a"\
| gsed -e 's/a[0-9]\{3\}.a/match/g;';
echo "result: $?";

I get the same (for me unexpected) result:

match - match - a128?a - a255?a - match
result: 0
  1. anybody else has the same behaviour ?
  2. can explain why ? (is it a bug in BSD ??) and/or how to circumvent/fix ? I can only guess it is related to the "supported character set" which would then be different on the different systems.... Especially since on the SL-system both BSD sed and GNU sed behave the same. I did however already check and alter my env: On the SL system:

    $> env | grep '^L'
    LANG=en_US.UTF-8
    LANGUAGE=en_US:en
    LC_CTYPE=UTF-8
    

    And on the Mint system:

    $user@Mint > env | grep '^L'
    LANG=en_US.UTF-8
    LANGUAGE=en_US:en
    LC_CTYPE=UTF-8
    
Was it helpful?

Solution

Your locale is UTF-8, but the byte sequence you are echoing is not valid UTF-8, because of \0200a and \0377a. If you use set LC_ALL=en_US.ISO8859-1 (iso-latin-1), then it works fine, because the result of echo is a valid iso-latin-1 string.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top