سؤال

I have a program that is procesing some sentences with LEX and YACC. I originaly developed in a Debian box with lex 2.5.35 and bison (GNU bison) 2.5. I have migrated the code to a CentOS server where I have lex 2.5.35 and bison (GNU Bison) 2.4.1. In the development server everything was working perfectly

I am seeing some strange behaviour when I receive long tokens (more than 1000 chars). Every char array has been defined long enough to support this but I am seeing that when yytext gets this long string other arrays are altered (I am thinkong of a buffer overflow).

Does this make sense or I am misundertanding something?

What is the length of yytext? Can it be redefined?

هل كانت مفيدة؟

المحلول

Unless you explicitly set the %array option in your flex input file (which is not a good idea[1]), yytext does not have a fixed size. Normally, it is just the part of the input buffer which contains the token, which is why you must copy yytext before the lexer is called again if you need to preserve the token's string representation. (I suspect that your problems are the result of not doing this.)

Since flex generally reads the input in fixed length pieces, it is possible for a token to span two or more buffers. In that case, flex needs to copy the first part of the token to the beginning of a buffer[2], possibly make the buffer bigger with realloc, and then read from the input to fill the rest of the buffer. This part of the flex logic is not optimized, on the basis that it is relatively infrequent; in particular, the entire current token is rescanned before proceeding with the next input character, which can produce a massive slow down if you have large tokens and small input buffers.

As I said, the most common cause of apparent buffer corruption is always a failure to copy yytext. You must do this if you need to keep the value of yytext:

  1. before returning from yylex, and

  2. before calling unput (if you use this feature).


Notes

[1] If you do specify %array then flex cannot expand the buffer and the maximum token size is slightly less than YYLMAX. By default, YYLMAX is about 8k but it's a macro and you can redefine it in your flex prologue. There is, however, no good reason to specify this option; all it does is slow down your scanner and limit the size of tokens. The option exists for compatibility with old versions of lex; some old software took some liberties with yytext which are not possible with flex.

[2] Actually, to the beginning of the buffer, since there is only one buffer. Any old tokens which were present in the buffer will be overwritten, which is another reason why yytext needs to be copied.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top