Question

I decided to post a question, after spending quite some time and still not figuring out the problem. Also read a bunch of seemingly related posts, none really fit my simple (?) problem.

So I have a possibly large text file (>1000 lines) that contains Mandarin Chinese chars, with a sample line like:

"ref#2-5-1.jpg#2#一些 <variable> 内容#pic##" (the Chinese just means "some content"). 

All that needs to be modified is that a space should be inserted between each character, if there is not one already:

"ref#2-5-1.jpg#2#一 些 <variable> 内 容#pic##".

I started naively with straightforward stuff like the following, but there is no match at all:

sed -e 's/\([\u4E00-\u9fff]\)/\1 /g' <test_utf_sed.txt > test_out.txt

where 4E00-9fff are supposed to be the code range for Mandarin Chinese. Unamazingly, this has not worked, so I also had wanted to try

sed -e 's/\([一-龻]\)/hello/g' <test_utf_sed.txt > test_out.txt

This failed because my bash cannot display (?) the "一" character.

Then I did some basic test, which failed as well:

sed -e 's/\(\u4E00\)/hello/g' <test_utf_sed.txt > test_out.txt //一
sed -e 's/\(\u4E9B\)/hello/g' <test_utf_sed.txt > test_out.txt //些

Same with another notation for utf encoding (found here on stackoverflow):

sed -e 's/\(\u'U+4E00\)/hello/g' <test_utf_sed.txt > test_out.txt

1) As tool for dealing with double byte chars, is sed the right choice at all?

2) Is sed able to handle unicode at all, or do I need a special switch?

3) I am not looking for a workaround solution like this:

step1: insert space after each character 
  //like 's/\(.\)/\1 /g')
step2: remove space after each chacter which is not a Chinese character 
  //like 's/\([a-zA-Z0-9]\) /\1/g')

I know how to do this but it is unelegant and error-prone. This must be possible using utf-8 in regex in sed.

4) My environment is bash-3.2 on a MacOS 10.6.8 (oldish OS).

5) If you know of any pointers to some open regEx-onliners as library dealing with Chinese text or language processing, it would be great to share.

Thanks a lot in advance, your help is much appreciated!

Was it helpful?

Solution

Perl has pretty good support for dealing with Unicode. That might be a better bet for your task than sed. This one-liner works like your first sed example:

perl -CIOED -p -e 's/\p{Script_Extensions=Han}/$& /g' filename

The -CIOED tells perl to do its I/O in utf8. -p runs the given code once for each line of the input file, then prints the result. -e specifies a line of Perl code to run. See the documentation on command-line arguments for more.

The regular expression uses named ranges to identify the characters to match.

You might also want to read the Perl Unicode documentation.

OTHER TIPS

sed doesn't understand \u escape sequences (apparently). I don't know if bash-3.2 does either, but I think it does; if so, you could write

sed $'s/\u4E9B/hello/g'

but you still wouldn't be able to do the range specification.

However, by translating to UTF-8 by hand, you could arrive at the following extended regular expression which will, I believe, match any UTF-8 sequence for a character in the range U+4E00...U+9FFF:

(\xe4[\xb8-\xbf][\x80-\xbf]|[\xe5-\xe9][\x80-\xbf][\x80-\xbf])

(But the character ranges will only work if you invoke sed in a single-byte locale, preferably the C locale.)

With GNU sed, you get extended regular expressions if you provide the -r flag. With MacOSX I believe you need the -E flag. So you could try:

LANG=C sed -E \
       $'s/(\xe4[\xb8-\xbf][\x80-\xbf]|[\xe5-\xe9][\x80-\xbf][\x80-\xbf])/\\1 /g' \
       <test_utf_sed.txt >test_out.txt

(The above lets bash handle the \x escapes. If you leave out the $, then sed will handle the \x escapes, but you'll have to change the substitution from \\1 to \1. I don't have a Mac, nor do have the old version of bash, so I really don't know whether your sed does hex escapes or not; I'm pretty sure that your bash will, but I can't guarantee it.)


By the way, it's not that difficult to get the utf-8 encodings for those characters; I did it with a little copy-and-paste from the original post. Eg.:

$ hd <<<"一些"
00000000  e4 b8 80 e4 ba 9b 0a                              |.......|

It helps to know that the entire range of plane 0 ideographs (U+4E00...U+9FFF) have three-byte codes, so that 一 is E4 B8 80 and 些 is E4 BA 9B. (The 0A is, of course, a line-end.)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top