How can a lexer extract a token in ambiguous languages?

https://stackoverflow.com/questions/11592083

22-06-2021
|

Question

I wish to understand how does a parser work. I learnt about the LL, LR(0), LR(1) parts, how to build, NFA, DFA, parse tables, etc.

Now the problem is, i know that a lexer should extract tokens only on the parser demand in some situation, when it's not possible to extract all the tokens in one separated pass. I don't exactly understand this kind of situation, so i'm open to any explanation about this.

The question now is, how should a lexer does its job ? should it base its recognition on the current "contexts", the current non-terminals supposed to be parsed ? is it something totally different ? What about the GLR parsing : is it another case where a lexer could try different terminals, or is it only a syntactic business ? I would also want to understand what it's related to, for example is it related to the kind of parsing technique (LL, LR, etc) or only the grammar ?

Thanks a lot

Solution

The simple answer is that lexeme extraction has to be done in context. What one might consider be lexemes in the language may vary considerably in different parts of the language. For example, in COBOL, the data declaration section has 'PIC' strings and location-sensitive level numbers 01-99 that do not appear in the procedure section.

The lexer thus to somehow know what part of the language is being processed, to know what lexemes to collect. This is often handled by having lexing states which each process some subset of the entire language set of lexemes (often with considerable overlap in the subset; e.g., identifiers tend to be pretty similar in my experience). These states form a high level finite state machine, with transitions between them when phase changing lexemes are encountered, e.g., the keywords that indicate entry into the data declaration or procedure section of the COBOL program. Modern languages like Java and C# minimize the need for this but most other languages I've encountered really need this kind of help in the lexer.

So-called "scannerless" parsers (you are thinking "GLR") work by getting rid of the lexer entirely; now there's no need for the lexer to produce lexemes, and no need to track lexical states :-} Such parsers work by simply writing the grammar down the level of individual characters; typically you find grammar rules that are the exact equivalent of what you'd write for a lexeme description. The question is then, why doesn't such a parser get confused as to which "lexeme" to produce? This is where the GLR part is useful. GLR parsers are happy to process many possible interpretations of the input ("locally ambiguous parses") as long as the choice gets eventually resolved. So what really happens in the case of "ambiguous tokens" is the the grammar rules for both "tokens" produce nonterminals for their respectives "lexemes", and the GLR parser continues to parse until one of the parsing paths dies out or the parser terminates with an ambiguous parse.

My company builds lots of parsers for languages. We use GLR parsers because they are very nice for handling complex languages; write the context-free grammar and you have a parser. We use lexical-state based lexeme extractors with the usual regular-expression specification of lexemes and lexical-state-transitions triggered by certain lexemes. We could arguably build scannerless GLR parsers (by making our lexers produce single characters as tokens :) but we find the efficiency of the state-based lexers to be worth the extra trouble.

As practical extensions, our lexers actually use push-down-stack automata for the high level state machine rather than mere finite state machines. This helps when one has high level FSA whose substates are identical, and where it is helpful for the lexer to manage nested structures (e.g, match parentheses) to manage a mode switch (e.g., when the parentheses all been matched).

A unique feature of our lexers: we also do a little tiny bit of what scannerless parsers do: sometimes when a keyword is recognized, our lexers will inject both a keyword and an identifier into the parser (simulates a scannerless parser with a grammar rule for each). The parser will of course only accept what it wants "in context" and simply throw away the wrong alternative. This gives us an easy to handle "keywords in context otherwise interpreted as identifiers", which occurs in many, many languages.

OTHER TIPS

Ideally, the tokens themselves should be unambiguous; you should always be able to tokenise an input stream without the parser doing any additional work.

This isn't always so simple, so you have some tools to help you out:

Start conditions

A lexer action can change the scanner's start condition, meaning it can activate different sets of rules.

A typical example of this is string literal lexing; when you parse a string literal, the rules for tokenising usually become completely different to the language containing them. This is an example of an exclusive start condition.

You can separate ambiguous lexings if you can identify two separate start conditions for them and ensure the lexer enters them appropriately, given some preceding context.
Lexical tie-ins

This is a fancy name for carrying state in the lexer, and modifying it in the parser. If a certain action in your parser gets executed, it modifies some state in the lexer, which results in lexer actions returning different tokens. This should be avoided when necessary, because it makes your lexer and parser both more difficult to reason about, and makes some things (like GLR parsers) impossible.

The upside is that you can do things that would require significant grammar changes with relatively minor impact on the code; you can use information from the parse to influence the behaviour of the lexer, which in turn can come some way to solving your problem of what you see as an "ambiguous" grammar.
Logic, reasoning

It's probable that it is possible to lex it in one parse, and the above tools should come second to thinking about how you should be tokenising the input and trying to convert that into the language of lexical analysis. :)

The fact is, your input is comprised of tokens—whether you like it or not!—and all you need to do is find a way to make a program understand the rules you already know.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow