Unless you explicitly set the %array
option in your flex input file (which is not a good idea[1]), yytext
does not have a fixed size. Normally, it is just the part of the input buffer which contains the token, which is why you must copy yytext
before the lexer is called again if you need to preserve the token's string representation. (I suspect that your problems are the result of not doing this.)
Since flex generally reads the input in fixed length pieces, it is possible for a token to span two or more buffers. In that case, flex needs to copy the first part of the token to the beginning of a buffer[2], possibly make the buffer bigger with realloc
, and then read from the input to fill the rest of the buffer. This part of the flex logic is not optimized, on the basis that it is relatively infrequent; in particular, the entire current token is rescanned before proceeding with the next input character, which can produce a massive slow down if you have large tokens and small input buffers.
As I said, the most common cause of apparent buffer corruption is always a failure to copy yytext
. You must do this if you need to keep the value of yytext
:
before returning from yylex, and
before calling
unput
(if you use this feature).
Notes
[1] If you do specify %array
then flex
cannot expand the buffer and the maximum token size is slightly less than YYLMAX
. By default, YYLMAX
is about 8k but it's a macro and you can redefine it in your flex prologue. There is, however, no good reason to specify this option; all it does is slow down your scanner and limit the size of tokens. The option exists for compatibility with old versions of lex
; some old software took some liberties with yytext
which are not possible with flex
.
[2] Actually, to the beginning of the buffer, since there is only one buffer. Any old tokens which were present in the buffer will be overwritten, which is another reason why yytext
needs to be copied.