Question

I'm writing a (simple?) JFlex tokenizer whose goal is to take a string, and tease apart the chunks that are in Chinese (or rather using the Han script), and the parts that are in a Latin script. The tokenizer is applied to brand names, and in my use case a brand name may contain both the Latin and the Chinese name, e.g. "Lenovo 联想".

Brand names can further contain numbers (7up), hyphens (Hewlett-Packard), ampersands (P&G), etc. My tokenizer mostly works, except for cases where the names in Chinese and non-Chinese are written together without any space or separation. Specifically, these are examples of successful and unsuccessful parses:

  • "Calvin Klein卡尔文.克莱" - successfully split into "Calvin Klein" and "卡尔文.克莱", and they get tagged as having the expected script (Latin and Han)

  • "圣威廉SAINT WILLIAM" - wrongly split into "圣威廉SAINT" (marked as Han chars) and "WILLIAM" (marked as Latin).

  • "史努比SNOOPY" - wrongly considered a single Han token.

I thought my rules were pretty unambiguous, but the results seem to indicate otherwise. Here's my rule set:

digit      = [0-9]
whitespace = [ \t\r\n] | \r\n

latin = [\u0041-\u005a\u0061-\u007a\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u01bf\u01c4-\u024f]
han   = [\u3400-\u9fff\uf900-\ufaff\u2f800-\u2fa1f]

// Punctuation in the middle or end of string sequences in a particular script
latin_middle  = [&.\-'`‘]
latin_end     = [.]
han_middle    = [.]

// A basic Latin token contains a mixture of Latin characters and possibly digits.
basic_latin_tok    = ({latin} | {digit})+

compound_latin_tok = {basic_latin_tok} (({whitespace}+ | {latin_middle}) {basic_latin_tok})*{latin_end}?

basic_han_tok     = {han}({han} | {digit})* 
                  | ({han} | {digit})*{han}

compound_han_tok  = {basic_han_tok}({han_middle}{basic_han_tok})*

%%

{compound_latin_tok}             { return "Latin"; }
{compound_han_tok}               { return "Han"; }
.                                { /* skip everything else */ }

What am I doing wrong?

Thanks!!

EDIT

I asked the folks on the SourceForge JFlex mailing list, and one of them replied to me - turns out JFlex 1.4.* can't handle Unicode characters that are not representable in 16 bits. Since some of the character ranges I've specified above for Han characters go above 16-bit values, JFlex gets confused. Removing those from the regex made it all work nicely.

For reference: http://jflex.de/manual.html#SECTION000101000000000000000

Was it helpful?

Solution

I asked the folks on the SourceForge JFlex mailing list, and one of them replied to me - turns out JFlex 1.4.* can't handle Unicode characters that are not representable in 16 bits.

Oh, but it can handle those.

First at all, let me bring a correction to \u2f800-\u2fa1f. The last value is indeed not representable on 16 bits, but only because the UNICODE block definition stops at \u2fa1d, thus the value is UNICODE invalid even on its 32 bits representation

Now, the trick to convince JFlex to handle OK your troubled [\u2f800-\u2fa1d] range: Java code (and string literals) uses sort-of a UTF16 encoding, so a 16-bits "surrogate" (the higher UTF16 word) followed by the other 16 bits paired character.

For the range that you need, you are lucky: the first 16bits surrogate stays the same for the whole range i.e. \uD87E, while the lower 16bit varies in the [\uDC00-\uDE1D] range. So, your han macro becomes

han   = [\u3400-\u9fff\uf900-\ufaff] | \uD87E[\uDC00-\uDE1D]

A resource which, in my laziness to transform 32bit wchars to UTF16/8 encoding, I found useful: http://www.fileformat.info. E.g. http://www.fileformat.info/info/unicode/char/2fa1d/index.htm and scroll down to "UTF-16 (hex)" or "C/C++/Java source code".

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top