Resolving JavaCC token ambiguity

https://stackoverflow.com/questions/23384013

12-07-2023
|

Question

I'm trying to parse regular expressions using JavaCC but I encountered a problem with integers. The problem is that sometimes, in some productions, I want to interpret a set of numbers as a character each, however, on something like (ab){1,20} I want to interpret the numbers inside the braces as integers. The problem is that JavaCC is choosing the first token that matches in the list, regardless of if that token is expected in the production or not.

I have a token DIGIT and a token INTEGER defined as one or more DIGITs. If I prioritize DIGIT, it will never choose INTEGER, if I prioritize INTEGER, in the productions where I want to interpret digits one by one it will choose INTEGER.

I also tried to define something like (< DIGIT >)+ in the production expecting an integer, but then I don't know how to assign that to a Token. Is there a way to assign the whole sequence to a single token, or at least append each digit to the image of one token or store an array of tokens?

Solution

If you want digits to be interpreted as single tokens sometimes and as integers at others, you need to use lexical states. See the documentation and the FAQ. You can probably switch states on a { and back on a }. Something like this

<DEFAULT> TOKEN : {
    <DIGIT : ["0"-"9"]>
}
<INBRACES> TOKEN : {
    <NUMBER : (["0"-"9"])+ >
}
<*> TOKEN {
    <LBRACE : "{" > : INBRACES 
|
    <RBRACE : "}" > : DEFAULT
|
    ...other rules apply in all states...
}

OTHER TIPS

You're trying to do something in the scanner that should be done in the grammar. The scanner should return numbers as numbers, and in the places in the grammar where you want to allow numbers as well as characters, allow numbers to appear in the production.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow