سؤال

I'm trying to learn a little more about compiler construction, so I've been toying with flexc++ and bisonc++; for this question I'll refer to flex/bison however.

In bison, one uses the %token declaration to define token names, for example

%token INTEGER
%token VARIABLE

and so forth. When bison is used on the grammar specification file, a header file y.tab.h is generated which has some define directives for each token:

#define INTEGER 258
#define VARIABLE 259

Finally, the lexer includes the header file y.tab.h by returning the right code for each token:

%{
#include "y.tab.h"
%}

%%

[a-z]   { 
            // some code
            return VARIABLE;
        }

[1-9][0-9]* {
            // some code
            return INTEGER;
        }

So the parser defines the tokens, then the lexer has to use that information to know which integer codes to return for each token.

Is this not totally bizarre? Normally, the compiler pipeline goes lexer -> parser -> code generator. Why on earth should the lexer have to include information from the parser? The lexer should define the tokens, then flex creates a header file with all the integer codes. The parser then includes that header file. These dependencies would reflect the usual order of the compiler pipeline. What am I missing?

هل كانت مفيدة؟

المحلول

As with many things, it's just a historical accident. It certainly would have been possible for the token declarations to have been produced by lex (but see below). Or it would have been possible to force the user to write their own declarations.

It is more convenient for yacc/bison to produce the token numberings, though, because:

  • The terminals need to be parsed by yacc because they are explicit elements in the grammar productions. In lex, on the other hand, they are part of the unparsed actions and lex can generate code without any explicit knowledge about token values; and

  • yacc (and bison) produce parse tables which are indexed by terminal and non-terminal numbers; the logic of the tables require that terminals and non-terminals have distinct codes. lex has no way of knowing what the non-terminals are, so it can't generate appropriate codes.

The second argument is a bit weak, because in practice bison-generated parsers renumber token ids to fit them into the id-numbering scheme. Even so, this is only possible if bison is in charge of the actual numbers. (The reason for the renumbering is to make the id value contiguous; by another historical accident, it's normal to reserve codes 0 through 255 for single-character tokens, and 0 for EOF; however, not all the 8-bit codes are actually used by most scanners.)

نصائح أخرى

In the lexer, the tokens are only present in the return value: they are part of the target language (ie. C++), and lex itself knows nothing about them.

In the parser, on the other hand, tokens are part of the definition language: you write them in the actual parser definition, and not just in the target language. So yacc has to know about these tokens.

The ordering of the phases is not necessarily reflected in the architecture of the compiler. The scanner is the first phase and the parser the second, so in a sense data flows from the scanner to the parser, but in a typical Bison/Flex-generated compiler it is the parser that controls everything, and it is the parser that calls the lexer as a helper subroutine when it needs a new token as input in the parsing process.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top