Keeping hexadecimal numbers, and text that only has A-F separate

https://stackoverflow.com/questions/20447709

30-08-2022
|

문제

Among other more specific and non-problematic ones, I have the following lexer rules:

[a-zA-Z][a-zA-Z0-9]+ {
    yylval.string = strdup(yytext);
    return FILENAME;
}

 /* 32-bit numbers */
[a-fA-F0-9]{1,8} {
    std::stringstream ssh;
    ssh << std::hex << yytext;
    ssh >> yylval.u32.hex;
    std::stringstream ssd;
    ssd << std::dec << yytext;
    ssd >> yylval.u32.dec;
    return NUMBER;
}

The "NUMBER" rule is already a kludge because I can't distinguish HEX from DEC base numbers in the grammar I'm implementing. Hex numbers have no prefix, it's all context-based. So in the parser rules I just pick the number from the struct that I know I need.

It seems like I need to extend this even further. I have a "filename" type, which is just any alphanumeric string that at the least begins with an alpha character, followed by any alphanumeric (as defined above). Problem is, a filename like fffff is causing incorrect parsing. It feels like the only way I can take care of this is to combine the NUMBER and FILENAME into something like ALPHANUMERIC, where I would do something like:

 /* 32-bit numbers, strings, sigh... */
[a-zA-Z0-9]{1,8} {
    std::stringstream ssh;
    ssh << std::hex << yytext;
    ssh >> yylval.alphanumeric.hex;
    std::stringstream ssd;
    ssd << std::dec << yytext;
    ssd >> yylval.alphanumeric.dec;
    yylval.alphanumeric.string = strdup(yytext);
    return ALPHANUMERIC;
}

Then I would have a be a bit smarter in the parser and check for an initial alpha, and use the right struct field.

Is this a common compromise? It feels wrong, and the more liberal the lexing, I'm sure I'm going to create holes I haven't tested where it will either fail or capture too much. I'll end up converting lots of strings like "hello" to hex and dec values needlessly.

해결책

The usual way is to use different flex rules for the different classes of tokens that can occur, with A_OR_B tokens for things that might be two different things:

[0-9]+ {
    yylval.u32 = strtol(yytext, 0, 10);
    return NUMBER; }
[a-fA-F][a-fA-F0-9]* {
    yylval.string = strdup(yytext);
    return NUMBER_OR_NAME; }
[a-fA-F0-9]+ {
    yylval.u32 = strtol(yytext, 0, 16);
    return NUMBER; }
[a-zA-Z][a-zA-Z0-9]* {
    yylval.string = strdup(yytext);
    return NAME; }

Flex will always try to match the longest match, but when multiple patterns match the same length, it will match the first one.

In your parser, you use rules like:

name: NAME | NUMBER_OR_NAME ;

number: NUMBER | NUMBER_OR_NAME { $$ = strtol($1, 0, 16); free($1); } ;

다른 팁

I would go for a differentiation of dec and hex like hexnumber and decnumber, but being purely context-based you would have to define some constraints like a filename would have to use at least 9 characters since this wouldn't be a valid 32 bits hex.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow