Recursive Descent Parser with Indentation and Backtracking
-
20-12-2019 - |
Question
I have been trying to find an Recursive Descent Parser Algorithm that is also suited for indentation with backtracking. But I keep having myself finding troublesome solutions for this.
Are there any resources out there that also deal with indentation?
Thanks
Solution
Based on your question I'm assuming that you're writing your own recursive descent parser for an indentation-sensitive language.
I've experimented with indentation-based languages before, and I solved the problem by having a state that keeps track of the current indentation level and two different terminals that match indentation. Both of them match indentation units (say two spaces or a tab) and count them. Let's call the matched indentation level matched_indentation
and the current indentation level expected_indentation
.
For the first one, let's call it indent
:
- if
matched_indentation < expected_indentation
, this is adedent
, and the match is a failure. - if
matched_indentation == expected_indentation
, the match is a success. The matcher consumes the indentation. - if
matched_indentation > expected_indentation
, you have a syntax error (indentation out of nowhere) and should handle it as such (throw an exception or something).
For the second one, let's call it dedent
:
if
matched_indentation < expected_indentation
, the match is successful. You reduceexpected_indentation
by one, but you don't consume the input. This is so that you can chain multiplededent
terminals to close multiple scopes.if
matched_indentation == expected_indentation
, the match is successful, and this time you do consume the input (this is the lastdedent
terminal, all scopes are closed).if
matched_indentation > expected_indentation
, the match simply fails, you don't have adedent
here.
Those terminals and non-terminals after which you expect an increase in indentation should increase expected_indentation
by one.
Let's say that you want to implement a python-like if statement (I'll use EBNF-like notation), it would look something like this:
indented_statement : indent statement newline;
if_statement : 'if' condition ':' newline indented_statement+ dedent ;
Now let's look at the following piece of code, and also assume that an if_statement
is a part of your statement
rule:
1|if cond1: <- expected_indentation = 0, matched_indentation = 0
2| if cond2: <- expected_indentation = 1, matched_indentation = 1
3| statement1 <- expected_indentation = 2, matched_indentation = 2
4| statement2 <- expected_indentation = 2, matched_indentation = 2
5| <- expected_indentation = 2, matched_indentation = 0
- On the first four lines you'll successfully match an
indent
terminal - On the last line, you'll match two
dedent
terminals, closing both the scopes, and resulting withexpected_indentation = 0
One thing you should be careful of is where you put your indent
and dedent
terminals. In this case, we don't need one in the if_statement
rule because it is a statement
, and indented_statement
already expects an indent.
Also mind how you treat newlines. One choice is to consume them as a sort of statement terminator, another is to have them precede the indentation, so choose whichever suits you best.