Question

I want to pass the actual string of a token. If I have a token called ID, then I want my yacc file to actually know what ID is called. I thing I have to pass a string using yylval to the yacc file from the flex file. How do I do that?

Was it helpful?

Solution

See the Flex manual section on Interfacing with YACC.

15 Interfacing with Yacc

One of the main uses of flex is as a companion to the yacc parser-generator. yacc parsers expect to call a routine named yylex() to find the next input token. The routine is supposed to return the type of the next token as well as putting any associated value in the global yylval. To use flex with yacc, one specifies the `-d' option to yacc to instruct it to generate the file y.tab.h containing definitions of all the %tokens appearing in the yacc input. This file is then included in the flex scanner. For example, if one of the tokens is TOK_NUMBER, part of the scanner might look like:

     %{
     #include "y.tab.h"
     %}

     %%

     [0-9]+        yylval = atoi( yytext ); return TOK_NUMBER;

OTHER TIPS

The key to returning a string or any complex type via yylval is the YYSTYPE union created by yacc in the y.tab.h file. The YYSTYPE is a union with a member for each type of token defined within the yacc source file. For example to return the string associated with a SYMBOL token in the yacc source file you declare this YYSTYPE union using %union in the yacc source file:

/*** Yacc's YYSTYPE Union ***/

/* The yacc parser maintains a stack (array) of token values while
   it is parsing.  This union defines all the possible values tokens
   may have.  Yacc creates a typedef of YYSTYPE for this union. All
   token types (see %type declarations below) are taken from
   the field names of this union.  The global variable yylval which lex
   uses to return token values is declared as a YYSTYPE union.
 */

    %union {
        long int4;              /* Constant integer value */
        float fp;               /* Constant floating point value */
        char *str;              /* Ptr to constant string (strings are malloc'd) */
        exprT expr;             /* Expression -  constant or address */
        operatorT *operatorP;   /* Pointer to run-time expression operator */
    };

%type <str> SYMBOL

Then in the LEX source file there is a pattern that matches the SYMBOL token. It is the responsibility of code associated with that rule to return the actual string that represents the SYMBOL. You can't just pass a pointer to the yytext buffer because it is a static buffer that is reused for each token that is matched. To return the matched text the static yytext buffer must be replicated on the heap with _strdup() and a pointer to this string passed via yyval.str. It is then the yacc rule that matches the SYMBOL token's responsibility to free the heap allocated string when it is done with it.

[A-Za-z_][A-Za-z0-9_]*  {{
    int i;

    /*
    * condition letter followed by zero or more letters
    * digits or underscores
    *      Convert matched text to uppercase
    *      Search keyword table
    *      if found
    *          return <keyword>
    *      endif
    * 
    *      set lexical value string to matched text
    *      return <SYMBOL>
    */

    /*** KEYWORDS and SYMBOLS ***/
    /* Here we match a keywords or SYMBOL as a letter
    * followed by zero or more letters, digits or 
    * underscores.
    */

    /* Convert the matched input text to uppercase */
    _strupr(yytext);         /* Convert to uppercase */

    /* First we search the keyword table */
    for (i = 0; i<NITEMS(keytable); i++) {
        if (strcmp(keytable[i].name, yytext)==0)
            return (keytable[i].token);
    }

    /* Return a SYMBOL since we did not match a keyword */
    yylval.str=_strdup(yytext);
    return (SYMBOL);
}}

Setting up the context

Syntax analysis (to check if an input text follows a specified grammar) consist of two phases:

  1. tokenizing, which is done by tools like lex or flex, with interface yylex()) and
  2. parsing the stream of token generated in step 1 ( as per a user specified grammar), which is done by tools like bison/yacc with the interface yyparse()).

While doing phase 1, given an input stream, each call to yylex() identifies a token (a char string) and yytext points to the first character of that string.For example: With an input stream of "int x = 10;" and with lex rules for tokenization conforming to C language, then first 5 calls to yylex() will identify the following 5 tokens "int", "x", "=", "10", ";" and each time the yytext will point to first char of the return token.

Phase 2, The parser (which you mentioned as yacc ) is a program which is calling this yylex function every time to get a token and uses these tokens to see if it is matching the rules of a grammar. These calls to yylex will return tokens as some integer codes. For example in the previous example, the first 5 calls to yylex() may return the following integers to the parser: TYPE, ID, EQ_OPERATOR and INTEGER ( whose actual integer values are defined in some header file).

Now all parser can see is those integer codes, which may not be useful at times. For example, in the running example you may want to associate TYPE to int, ID to some symbol table pointer, and INTEGER to decimal 10. To facilitate that, each token returned by yylex with associated with another VALUE whose default type is int, but you may have custom types for that. In lex environment this VALUE is accessed as yylval.

For example, again as per the running example, yylex may have the following rule to identify 10

[0-9]+   {  yylval.intval = atoi(yytext); return INTEGER; }

and following to identify x

[a-zA-Z][a-zA-Z0-9]*   {yylval.sym_tab_ptr = SYM_TABLE(yytext); return ID;}

Note that here I have defined the VALUE's ( or yylval's) type as a union containing an int (intval) and an int* pointer (sym_tab_ptr).

But in the yacc world, this VALUE is identified / accessed as $n. For example, consider the following yacc rule to identify a specific assignment statement

TYPE ID '=' VAL:  { //In this action part of the yacc rule, use $2 to get the symbol table pointer associated with ID, use $4 to get decimal 10.}

Answering your question

If you want to access the yytext value of a certain token (which is related to lex world) in yacc world, use that old friend VALUE as folowing:

  1. Augment the union type of VALUE to add another field say char* lex_token_str
  2. In the lex rule, do yylval.lex_token_str = strdup(yytext)
  3. Then in yacc world access it using the appropriate $n.
  4. In case you want to access more that a single value of a token, (for example for the lex identified token ID, the parser may want to access both the name and the symbol table pointer), then augment the union type of VALUE with a structure member, containing char* (for name) and int*(for symtab pointer).
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top