Recursive Descent Parser with Indentation and Backtracking

https://stackoverflow.com//questions/20006049

20-12-2019
|

Question

I have been trying to find an Recursive Descent Parser Algorithm that is also suited for indentation with backtracking. But I keep having myself finding troublesome solutions for this.

Are there any resources out there that also deal with indentation?

Thanks

Solution

Based on your question I'm assuming that you're writing your own recursive descent parser for an indentation-sensitive language.

I've experimented with indentation-based languages before, and I solved the problem by having a state that keeps track of the current indentation level and two different terminals that match indentation. Both of them match indentation units (say two spaces or a tab) and count them. Let's call the matched indentation level matched_indentation and the current indentation level expected_indentation.

For the first one, let's call it indent:

if matched_indentation < expected_indentation, this is a dedent, and the match is a failure.
if matched_indentation == expected_indentation, the match is a success. The matcher consumes the indentation.
if matched_indentation > expected_indentation, you have a syntax error (indentation out of nowhere) and should handle it as such (throw an exception or something).

For the second one, let's call it dedent:

if matched_indentation < expected_indentation, the match is successful. You reduce expected_indentation by one, but you don't consume the input. This is so that you can chain multiple dedent terminals to close multiple scopes.
if matched_indentation == expected_indentation, the match is successful, and this time you do consume the input (this is the last dedent terminal, all scopes are closed).

if matched_indentation > expected_indentation, the match simply fails, you don't have a dedent here.

Those terminals and non-terminals after which you expect an increase in indentation should increase expected_indentation by one.

Let's say that you want to implement a python-like if statement (I'll use EBNF-like notation), it would look something like this:

indented_statement : indent statement newline;    

if_statement : 'if' condition ':' newline indented_statement+ dedent ;

Now let's look at the following piece of code, and also assume that an if_statement is a part of your statement rule:

1|if cond1:                 <- expected_indentation = 0, matched_indentation = 0
2|  if cond2:               <- expected_indentation = 1, matched_indentation = 1
3|    statement1            <- expected_indentation = 2, matched_indentation = 2
4|    statement2            <- expected_indentation = 2, matched_indentation = 2
5|                          <- expected_indentation = 2, matched_indentation = 0

On the first four lines you'll successfully match an indent terminal
On the last line, you'll match two dedent terminals, closing both the scopes, and resulting with expected_indentation = 0

One thing you should be careful of is where you put your indent and dedent terminals. In this case, we don't need one in the if_statement rule because it is a statement, and indented_statement already expects an indent.

Also mind how you treat newlines. One choice is to consume them as a sort of statement terminator, another is to have them precede the indentation, so choose whichever suits you best.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow