Flex: Handle a multi-char comment-like delimiter

Question

The text inside {literal} is split at { because you match {; if you don't want the text to be split, you need to use the rules inside the LITERAL start condition to extend the match, rather than each one creating a new match. This is a fairly common (f)lex idiom, and there is a feature designed specifically for this purpose: yymore:

yymore() tells the scanner that the next time it matches a rule, the corresponding token should be appended onto the current value of yytext rather than replacing it.

(from the flex manual.)

Using that handy feature, we can write:

"{literal}"           {BEGIN(LITERAL);}
<LITERAL>[^{]+        {yymore();}
<LITERAL>"{"          {yymore();}
<LITERAL>"{/literal}" {
                        /* Now we have to provide the token, but we've matched
                         * 10 extra characters, the close marker, and so the
                         * token is the text from yytext with length yyleng-10.
                         * Here we just print it out, but normally we'd copy
                         * yytext to a temporary for future processing.
                         * Most compilers will optimize out the call to strlen.
                         */
                        BEGIN(INITIAL);
                        printf("TEXT: %.*s\n",
                               (int)(yyleng - strlen("{/literal}")),
                               yytext);
                      }

The above assumes that the LITERAL state is literally literal :), that is, that it is only terminated with the {/literal} tag and that the {/literal} tag is always recognized, regardless of context. However, it's not dependent on that; you could do more complex token recognition inside the literal scan, as long as you always use yymore() in every action except the action for the closing tag.

If my assumption is correct, another solution is available: simply match the entire literal with a regular expression. It would be easier to write the regular expression with non-greedy matches (or even directly as a finite state machine), but unfortunately flex doesn't implement those, so it has to be done the long way, and it is truly long. Fortunately, the end marker starts with a character which is not contained inside the end-marker, so the regular expression can be generated mechanically fairly easily. Here, I've used flex definitions to avoid a really long line, and to make the pattern a little more apparent:

l1             [{]
l2          "/"[{]
l3         "/l"[{]
l4        "/li"[{]
l5       "/lit"[{]
l6      "/lite"[{]
l7     "/liter"[{]
l8    "/litera"[{]
l9   "/literal"[{]
loop [{](l1|l2|l3|l4|l5|l6|l7|l8|l9)*

n1             [^{/]
n2          "/"[^{l]
n3         "/l"[^{i]
n4        "/li"[^{t]
n5       "/lit"[^{e]
n6      "/lite"[^{r]
n7     "/liter"[^{a]
n8    "/litera"[^{l]
n9   "/literal"[^{}]
next n1|n2|n3|n4|n5|n6|n7|n8|n9

prefix  "{literal}"
middle  ([^{]|{loop}{next})*
suffix  {loop}"/literal}"

literal {prefix}{middle}{suffix}

%%

{literal}  {
              /* The token includes both the {literal} opener and
               * the {/literal} closer, so we need to get rid of
               * both of them.
               */ 
              printf("TEXT: %.*s\n",
                     (int)(yyleng - strlen("{literal}") - strlen("{/literal}")),
                     yytext + strlen("{literal}"));
           }