Object oriented parsing: Is there a pattern or is my approach wrong?

https://softwareengineering.stackexchange.com/questions/396891

01-03-2021
|

Question

The problem that I am chewing on comes from parsing, i.e. constructing objects in a sequential manner. The grammar is not prefix free, that is, there are more than one syntactical elements sharing the same sequence of terminals at the beginning. To condense it to a simple example, lets say we have

 foo = "bar" ";"
 foo_special = foo "baz" ";"

The implementation language is object oriented in the C++ fashion, that is, a compile-time defined class definition with data elements.

Moreover the problem domain at first sight (that is, how I view it currently) makes it very natural and concise to handle non-terminals as objects of their respective classes and foo_special a specialisation of foo. My naive mind therefore yelled inheritance! and I implemented it accordingly. You may already guess where all this leads to:

When I see "bar" in the parsing stream, I have no idea if I should construct a foo or foo_special object.

Put differently, what is the generally preferred solution in a situation where you can't immediately decide on the object to construct but the possible alternatives share enough similarities to make code sharing via hierarchy attractive?

Solution

The alternatives depend on the general structure of your parser (and lexer) and what the consequences and costs are if you guess wrong about dealing with a foo or a foo_special object.

As a first alternative, you can drop the inheritance of foo_special and foo nd instead model their relation more along the structure of your grammar: foo_special contains a foo. This means you can unambiguously create a foo when you see "bar" in the token stream, and a foo_special when you see "baz" after just having created a foo.

If you want to keep the inheritance for some reason, you could also start out with creating a foo and replacing the non-terminal object with a foo_special object when "baz" turns up in the token stream.

Those techniques would not work if your grammar was slightly different, for example

foo = "bar" ";"
foo_special = "bar" "baz" ";"

Then you would turn to techniques like an N-token lookahead (with N being at least as large as the largest common prefix) or error recovery.

In the N-token lookahead you read N tokens from the lexer without consuming them in order to choose between foo and foo_special.

If you use error recovery, you choose one option and try to parse that. If you run into a parsing error, you rollback the parse tree until the decision point and try again with the next option. If you ran out of options, you report an error to the user.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange