What should be the datatype of the tokens a lexer returns to its parser?

https://softwareengineering.stackexchange.com/questions/328636

25-12-2020
|

题

As said in the title, which data type should a lexer return/give the parser? When reading the lexical analysis article that Wikipedia has, it stated that:

In computer science, lexical analysis is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an identified "meaning").

However, in complete contradiction to the above statement, When another question I asked on a different site(Code Review if you are curious) was answered, The person answering stated that:

The lexer usually reads the string and converts this into a stream ... of lexemes. The lexemes only need to be a stream of numbers.

and he gave this visual:

nl_output => 256
output    => 257
<string>  => 258

Later on in the article He mentioned Flex, a already existing lexer, and said writing 'rules' with it would be simpler than writing a lexer by hand. He proceeded to give me this example:

Space              [ \r\n\t]
QuotedString       "[^"]*"
%%
nl_output          {return 256;}
output             {return 257;}
{QuotedString}     {return 258;}
{Space}            {/* Ignore */}
.                  {error("Unmatched character");}
%%

To further my insight and get more information, I read the Wikipedia article about Flex. the Flex article showed that you could define a set of syntax rules, with tokens, in the following way:

digit         [0-9]
letter        [a-zA-Z]

%%
"+"                  { return PLUS;       }
"-"                  { return MINUS;      }
"*"                  { return TIMES;      }
"/"                  { return SLASH;      }
"("                  { return LPAREN;     }
")"                  { return RPAREN;     }
";"                  { return SEMICOLON;  }
","                  { return COMMA;      }
"."                  { return PERIOD;     }
":="                 { return BECOMES;    }
"="                  { return EQL;        }
"<>"                 { return NEQ;        }
"<"                  { return LSS;        }
">"                  { return GTR;        }
"<="                 { return LEQ;        }
">="                 { return GEQ;        }
"begin"              { return BEGINSYM;   }
"call"               { return CALLSYM;    }
"const"              { return CONSTSYM;   }
"do"                 { return DOSYM;      }
"end"                { return ENDSYM;     }
"if"                 { return IFSYM;      }
"odd"                { return ODDSYM;     }
"procedure"          { return PROCSYM;    }
"then"               { return THENSYM;    }
"var"                { return VARSYM;     }
"while"              { return WHILESYM;   }

It seems to me that the Flex lexer is returning strings of keywords\tokens. But It could be returning constants that are equal to certain numbers.

If the lexer was going to return numbers, how would it read string literals? returning a number is fine for single keywords, But how would you deal with a string? Wouldn't the lexer have to convert the string to binary numbers and then the parser would convert the numbers back to a string. It seems much more logical(and easier) for the lexer to return strings, and then let the parser convert any number string literals into actual numbers.

Or could the lexer possible return both? I've been trying to write a simple lexer in c++, which lets you have only one return type for your functions. Thus leading me to ask my question.

To condense my question into a paragraph: When writing a lexer, and assuming that it could only return one data type(strings or numbers), which would be the more logical choice?

解决方案

Generally, if you're processing a language though lexing and parsing, you've got a definition of your lexical tokens, e.g.:

NUMBER ::= [0-9]+
ID     ::= [a-Z]+, except for keywords
IF     ::= 'if'
LPAREN ::= '('
RPAREN ::= ')'
COMMA  ::= ','
LBRACE ::= '{'
RBRACE ::= '}'
SEMICOLON ::= ';'
...

and you have a grammar for the parser:

STATEMENT ::= IF LPAREN EXPR RPAREN STATEMENT
            | LBRACE STATEMENT BRACE
            | EXPR SEMICOLON
EXPR      ::= ID
            | NUMBER
            | ID LPAREN EXPRS RPAREN
...

Your lexer takes the input stream and produces a stream of tokens. The stream of tokens is consumed by the parser to produce a parse tree. In some cases, just knowing the type of the token is enough (e.g, LPAREN, RBRACE, FOR), but in some cases, you'll need the actual value that is associated with the token. For instance, when you encounter an ID token, you'll want the actual characters that make up the ID later on when you're trying to figure out what identifier you're trying to reference.

So, you typically have something more or less like this:

enum TokenType {
  NUMBER, ID, IF, LPAREN, RPAREN, ...;
}

class Token {
  TokenType type;
  String value;
}

So when the lexer returns a token, you know what type it is (which you need for the parsing), and the sequence of characters that it was generated from (which you'll need later on to interpret string and numeric literals, identifiers, etc.). It might feel like you're returning two values, since you're returning a very simple aggregate type, but you really do need both parts. After all, you'd want to treat the following programs differently:

if (2 > 0) {
  print("2 > 0");
}

if (0 > 2) {
  print("0 > 2");
}

These produce the same sequence of token types: IF, LPAREN, NUMBER, GREATER_THAN, NUMBER, RPAREN, LBRACE, ID, LPAREN, STRING, RPAREN, SEMICOLON, RBRACE. That means they parse the same, too. But when you're actually doing something with the parse tree, you'll care that the value of the first number is '2' (or '0') and that the value of the second number is '0' (or '2'), and that the value of the string is '2 > 0' (or '0 > 2').

其他提示

As said in the title, which data type should a lexer return/give the parser?

"Token", obviously. A lexer produces a stream of tokens, so it should return a stream of tokens.

He mentioned Flex, a already existing lexer, and said writing 'rules' with it would be simpler than writing a lexer by hand.

Machine-generated lexers have the advantage that you can generate them quickly, which is particularly handy if you think your lexical grammar is going to change a lot. They have the disadvantage that you often don't get a lot of flexibility in your implementation choices.

That said, who cares if it is "simpler"? Writing the lexer usually isn't the hard part!

When writing a lexer, and assuming that it could only return one data type(strings or numbers), which would be the more logical choice?

Neither. A lexer typically has a "next" operation that returns a token, so it should return a token. A token is not a string or a number. It's a token.

The last lexer I wrote was a "full fidelity" lexer, meaning that it returned a token that tracked the location of all whitespace and comments -- which we call "trivia" -- in the program, as well as the token. In my lexer a token was defined as:

An array of leading trivia
A token kind
A token width in characters
An array of trailing trivia

Trivia was defined as:

A trivia kind -- whitespace, newline, comment, and so on
A trivia width in characters

So if we had something like

    foo + /* comment */
/* another comment */ bar;

that would lex as four tokens with token kinds Identifier, Plus, Identifier, Semicolon, and widths 3, 1, 3, 1. The first identifier has leading trivia consisting of Whitespace with a width of 4 and trailing trivia Whitespace with width of 1. The Plus has no leading trivia and trailing trivia consisting of one whitespace, a comment and a newline. The final identifier has a leading trivia of a comment and a space, and so on.

With this scheme every character in the file gets accounted for in the output of the lexer, which is a handy property to have for things like syntax coloring.

Of course, if you don't need the trivia then you can simply make a token two things: the kind and the width.

You might notice that the token and trivia contain only their widths, not their absolute position in the source code. That's deliberate. Such a scheme has advantages:

It is compact in memory and wire format
It enables re-lexing upon edits; this is useful if the lexer is running inside an IDE. That is, if you detect an edit in a token, you just back up your lexer to a couple tokens before the edit and start lexing again until you're sync'ed with the previous token stream. When you type a character the position of every token after that character changes, but usually only one or two tokens change in width, so you can re-use all that state.
The exact character offsets of every token can easily be derived by iterating over the token stream and keeping track of the current offset. Once you have the exact character offsets then it is easy to extract the text when necessary.

If you don't care about any of those scenarios, then a token could be represented as a kind and an offset, rather than a kind and a width.

But the key takeaway here is: programming is the art of making useful abstractions. You are manipulating tokens, so make a useful abstraction over tokens, and then you get to choose for yourself what implementation details underlie it.

Generally, you return a small structure which has a number signifying the token (or enum value for ease of use) and an optional value (string, or possibly generic/templated value). Another approach would be to return a derived type for elements that need to carry extra data. Both are mildly distasteful, but fine enough solutions to a practical problem.

许可以下： CC-BY-SA 和归因

不隶属于 softwareengineering.stackexchange