JFlex and accented characters

https://stackoverflow.com/questions/16571232

29-05-2022
|

Question

I need to create a parser with JFlex to extract all words from an input file, including those with accented characters like á, é, í, ó, ú, ñ, ...

My problem is that even setting all files with UTF8 encoding and the %unicode tag I can't make it recognize those characters.

The .lex file is like this:

import java_cup.runtime.*;
%%
%class ParserLex
%unicode
%public
%final
%cup

%init{
%init}

%{
    private Symbol sym(int type) {
        return sym(type, yytext());
    }
    private Symbol sym(int type, Object value) {
        return new Symbol(type, yyline, yycolumn, value);
    }
%}

Token       = [áéíóú]
ANY         = .

%%

{Token}
    { System.out.println(yytext()); }

{ANY}
    {   }

And my test class is like this one:

class ParserTest {
    public static void main(String[] args) throws IOException {
        InputStreamReader reader = new InputStreamReader(new FileInputStream(args[0]), "UTF8");
        ParserLex parser = new ParserLex(reader);
        for (Symbol sym = parser.next_token(); sym.sym != 0; sym = parser.next_token()) {
        }
        reader.close();
    }
}

Any ideas or advice about this problem?

Solution

I recently discovered that jFlex outputs errors like

Warning in file "scanner.jflex" (line 42):
Rule can never be matched:
"???"  { return new Symbol(Symbols.CIRCLED_MINUS, 1, yycolumn + 1, null); }

for my UTF-8 character literals

"⊖"  { return new Symbol(Symbols.CIRCLED_MINUS, 1, yycolumn + 1, null); }

Being on Linux, I changed the LANG environment variable to specify encoding, e.g. C.UTF-8, and that removed the warning. Using command line option -Dfile.encoding=UTF-8 is more portable. I also found feature request 29, hinting that jFlex honors the system default encoding.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow