I'm trying to parse a language which has a certain "escape sequence" using flex/bison. Currently, I'm stuck defining the lexer. It's easiest to explain by example:

Text    --> {if} test {literal} text 43.21 {if} foo {/literal} {if}
            ---- ---- --------- ------------------- ---------- ----   etc.
Desired -->  IF  TEXT (ignore)         TEXT          (ignore)   IF
  Token

As you can see, the language contains some terminal symbols such as IF or TEXT, which are rather straight-forward. However, everything between {literal} and {/literal} is a TEXT, even if it contains strings which would otherwise be special tokens.

The best I could come up with for a lexer so far is this, which uses Start Conditions for jumping between the different states: If it encounteres a {literal}, it activates the LITERAL rules.

%{
#include <stdio.h>
#define YY_DECL int yylex()
%}
%x LITERAL
%%
[^{]+                 {printf("TEXT: %s\n", yytext);}
"{if}"                {printf("IF\n");}
"{literal}"           {BEGIN(LITERAL);}
<LITERAL>[^{]+        {printf("TEXT: %s\n", yytext);}
<LITERAL>"{"          {printf("TEXT: %s\n", yytext);}
<LITERAL>"{/literal}" {BEGIN(INITIAL);}
%%
main() {yylex();}

But how to leave the LITERAL state? Using this definition with the example above gives

IF
TEXT:  test 
TEXT:  text 43.21 
TEXT: {
TEXT: if} foo 
IF

In other words, the TEXT token within the {literal} tags is split at the {. How can I avoid this?

有帮助吗?

解决方案

The text inside {literal} is split at { because you match {; if you don't want the text to be split, you need to use the rules inside the LITERAL start condition to extend the match, rather than each one creating a new match. This is a fairly common (f)lex idiom, and there is a feature designed specifically for this purpose: yymore:

yymore() tells the scanner that the next time it matches a rule, the corresponding token should be appended onto the current value of yytext rather than replacing it.
(from the flex manual.)

Using that handy feature, we can write:

"{literal}"           {BEGIN(LITERAL);}
<LITERAL>[^{]+        {yymore();}
<LITERAL>"{"          {yymore();}
<LITERAL>"{/literal}" {
                        /* Now we have to provide the token, but we've matched
                         * 10 extra characters, the close marker, and so the
                         * token is the text from yytext with length yyleng-10.
                         * Here we just print it out, but normally we'd copy
                         * yytext to a temporary for future processing.
                         * Most compilers will optimize out the call to strlen.
                         */
                        BEGIN(INITIAL);
                        printf("TEXT: %.*s\n",
                               (int)(yyleng - strlen("{/literal}")),
                               yytext);
                      }

The above assumes that the LITERAL state is literally literal :), that is, that it is only terminated with the {/literal} tag and that the {/literal} tag is always recognized, regardless of context. However, it's not dependent on that; you could do more complex token recognition inside the literal scan, as long as you always use yymore() in every action except the action for the closing tag.

If my assumption is correct, another solution is available: simply match the entire literal with a regular expression. It would be easier to write the regular expression with non-greedy matches (or even directly as a finite state machine), but unfortunately flex doesn't implement those, so it has to be done the long way, and it is truly long. Fortunately, the end marker starts with a character which is not contained inside the end-marker, so the regular expression can be generated mechanically fairly easily. Here, I've used flex definitions to avoid a really long line, and to make the pattern a little more apparent:

l1             [{]
l2          "/"[{]
l3         "/l"[{]
l4        "/li"[{]
l5       "/lit"[{]
l6      "/lite"[{]
l7     "/liter"[{]
l8    "/litera"[{]
l9   "/literal"[{]
loop [{](l1|l2|l3|l4|l5|l6|l7|l8|l9)*

n1             [^{/]
n2          "/"[^{l]
n3         "/l"[^{i]
n4        "/li"[^{t]
n5       "/lit"[^{e]
n6      "/lite"[^{r]
n7     "/liter"[^{a]
n8    "/litera"[^{l]
n9   "/literal"[^{}]
next n1|n2|n3|n4|n5|n6|n7|n8|n9

prefix  "{literal}"
middle  ([^{]|{loop}{next})*
suffix  {loop}"/literal}"

literal {prefix}{middle}{suffix}

%%

{literal}  {
              /* The token includes both the {literal} opener and
               * the {/literal} closer, so we need to get rid of
               * both of them.
               */ 
              printf("TEXT: %.*s\n",
                     (int)(yyleng - strlen("{literal}") - strlen("{/literal}")),
                     yytext + strlen("{literal}"));
           }
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top