Antlr generated lexer hangs on unicode character of "supplementary plane" (antlr 3.4)

https://stackoverflow.com/questions/14041826

12-12-2021
|

Question

I'm parsing PHP code using an antlr Grammar and the antlr Ruby Target. One of the source file I have to parse actually contains translation, some of them making heavy use of Unicode character. The grammar seems to hang on one character from the "supplementary plane", namely U+10430.

I had a similar problem in the past due to the fact that the Ruby antlr target is quite old, and was not unicode compliant (well, Ruby was not, at the time). We had to bump RubyTarget.java getMaxCharValue from 0xFF (ascii) to 0xFFFF (unicode) to solve it. Now it seems that even this set is insufficient. Unicode states that characters outside this range may be represented using two UTF-16 characters, but how do antlr manage this ? Would bumping the getMaxCharValue again help (it did once, but I'm no fan of the "try" approach) ?

Thanks !

Solution

The reference Java target for ANTLR can only parse characters in the supplementary plane by using a UTF-16 surrogate pair in the grammar and using a UTF-16 encoding for your input stream. Other targets are created by members of the community and may or (as you saw the Ruby target) may not support the same range of characters.

Since there is no way to represent anything past 0xFFFE in the grammar itself, you'll be limited to the UTF-16 encoding even if you modify a target to support characters above 0xFF.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow