سؤال

Suppose I need simple grammar that describes language like

foo 2
bar 21

but not

foo1

Using jflex I wrote smt like

WORD=[a-zA-Z]+
NUMBER=[0-9]+
WHITE_SPACE_CHAR=[\ \n\r\t\f]

%state AFTER_WORD
%state AFTER_WORD_SEPARATOR

%%
<YYINITIAL>{WORD}               { yybegin(AFTER_WORD); return TokenType.WORD; }        
<AFTER_WORD>{WHITE_SPACE_CHAR}+ { yybegin(AFTER_WORD_SEPARATOR); return TokenType.WHITE_SPACE; }        
<AFTER_WORD_SEPARATOR>{NUMBER}  { yybegin(YYINITIAL); return TokenType.NUMBER; }        

{WHITE_SPACE_CHAR}+             { return TokenType.WHITE_SPACE; }

But I dont like extra states that used for saying that there should be whitespace between word and digit. How I can simplify my grammar?

هل كانت مفيدة؟

المحلول 2

From what I know of JFlex, if you are recognizing whitespaces corectly (which seems to be the case), you don't have to use extra states. Just make a rule for "identifiers", and another one for "numbers".

%%
{WORD}    { return TokenType.WORD; }
{NUMBER}  { return TokenType.NUMBER; }

If your language imposes each line to be consisted of exactly one identifier, one space and one number, this should be checked by syntactic analysis (i.e. by a parser), not lexical analysis.

نصائح أخرى

You shouldn't need white space tokens when parsing at all.

Get rid of TokenType.WHITE_SPACE, and when you get white space in the lexer, just ignore it instead of returning anything.

To prevent 'foo1', add another rule for [A-Za-z0-9] and another token type for it that doesn't appear in the grammar; then it's a syntax error.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top