Question

Is it possible to extract the first and follow sets from a rule using ANTLR4? I played around with this a little bit in ANTLR3 and did not find a satisfactory solution, but if anyone has info for either version, it would be appreciated.

I would like to parse user input up the user's cursor location and then provide a list of possible choices for auto-completion. At the moment, I am not interested in auto-completing tokens which are partially entered. I want to display all possible following tokens at some point mid-parse.

For example:

sentence: 
   subjects verb (adverb)? '.' ;

subjects:
   firstSubject (otherSubjects)* ;

firstSubject:
   'The' (adjective)? noun ;

otherSubjects:
   'and the' (adjective)? noun; 

adjective:
   'small' | 'orange' ;

noun: 
   CAT | DOG ;

verb:
   'slept' | 'ate' | 'walked' ;

adverb:
   'quietly' | 'noisily' ;

CAT : 'cat';
DOG : 'dog';

Given the grammar above...

If the user had not typed anything yet the auto-complete list would be ['The'] (Note that I would have to retrieve the FIRST and not the FOLLOW of rule sentence, since the follow of the base rule is always EOF).

If the input was "The", the auto-complete list would be ['small', 'orange', 'cat', 'dog'].

If the input was "The cat slept, the auto-complete list would be ['quietly', 'noisily', '.'].

So ANTLR3 provides a way to get the set of follows doing this:

BitSet followSet = state.following[state._fsp];

This works well. I can embed some logic into my parser so that when the parser calls the rule at which the user is positioned, it retrieves the follows of that rule and then provides them to the user. However, this does not work as well for nested rules (For instance, the base rule, because the follow set ignores and sub-rule follows, as it should).

I think I need to provide the FIRST set if the user has completed a rule (this could be hard to determine) as well as the FOLLOW set of to cover all valid options. I also think I will need to structure my grammar such that two tokens are never subsequent at the rule level.

I would have break the above "firstSubject" rule into some sub rules...

from

firstSubject:
    'The'(adjective)? CAT | DOG;

to

firstSubject:
     the (adjective)?  CAT | DOG;
the:
     'the'; 

I have yet to find any information on retrieving the FIRST set from a rule.

ANTLR4 appears to have drastically changed the way it works with follows at the level of the generated parser, so at this point I'm not really sure if I should continue with ANTLR3 or make the jump to ANTLR4.

Any suggestions would be greatly appreciated.

Was it helpful?

Solution

ANTLRWorks 2 (AW2) performs a similar operation, which I'll describe here. If you reference the source code for AW2, keep in mind that it is only released under an LGPL license.

  1. Create a special token which represents the location of interest for code completion.

    • In some ways, this token behaves like the EOF. In particular, the ParserATNSimulator never consumes this token; a decision is always made at or before it is reached.
    • In other ways, this token is very unique. In particular, if the token is located at an identifier or keyword, it is treated as though the token type was "fuzzy", and allowed to match any identifier or keyword for the language. For ANTLR 4 grammars, if the caret token is located at a location where the user has typed g, the parser will allow that token to match a rule name or the keyword grammar.
  2. Create a specialized ATN interpreter that can return all possible parse trees which lead to the caret token, without looking past the caret for any decision, and without constraining the exact token type of the caret token.

  3. For each possible parse tree, evaluate your code completion in the context of whatever the caret token matched in a parser rule.

  4. The union of all the results found in step 3 is a superset of the complete set of valid code completion results, and can be presented in the IDE.

The following describes AW2's implementation of the above steps.

  1. In AW2, this is the CaretToken, and it always has the token type CARET_TOKEN_TYPE.
  2. In AW2, this specialized operation is represented by the ForestParser<TParser> interface, with most of the reusable implementation in AbstractForestParser<TParser> and specialized for parsing ANTLR 4 grammars for code completion in GrammarForestParser.
  3. In AW2, this analysis is performed primarily by GrammarCompletionQuery.TaskImpl.runImpl(BaseDocument).
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top