Add error checking via production rules to LALR(1) grammar to handle all inputs

https://stackoverflow.com/questions/11940733

26-06-2021
|

Pergunta

I have a grammar that represents expressions. Let's say for simplicity it's:

S -> E
E -> T + E | T
T -> P * T | P
P -> a | (E)

With a, +, *, ( and ) being the letters in my alphabet.

The above rules can generate valid arithmetic expressions containing parenthesis, multiplication and addition using proper order of operations and associativity.

My goal is to accept every string, containing 0 or more of the letters of my alphabet. Here are my constraints:

The grammar must "accept" all strings contained 0 or more letters of my alphabet.
New terminals may be introduced and inserted into the input algorithmically. (I found that explicitly providing an EOF terminal helped when detecting extra tokens beyond an otherwise valid expression, for some reason.)
New production rules may be introduced. The new rules will be error flags (i.e. if the string is parsed using one, then although the string is accepted, it is considered to semantically be an error).
The production rules may be modified so that newly-introduced terminals are inserted.
The grammar should be LALR(1) at least handle-able by Lex/Yacc-like tools (If I recall the dangling else problem requires non-LALR(1), in spite of being compatible with Lex/Yacc).

Additionally I would like the extra rules to correspond to different kinds of errors (missing arguments to a binary operation, missing parenthesis - left or right - extra token(s) beyond an otherwise accept-able expression, etc.). I say that because there may be some trivial way to answer my question to "accept" all inputs that otherwise wouldn't be beneficial for error reporting.

I've found these rules to be useful, although I do not know if they violate my constraints, or not:

P -> @      [error]
P -> (E     [error]
S -> E $    [instead of S -> E]
S -> E X $  [error]
X -> X Y    [error]
X -> Y      [error]
Y -> a      [error]
Y -> (      [error]
Y -> )      [error]
Y -> *      [error]
Y -> +      [error]

where $ is the explicit EOF token and @ is the empty string.

In case my question wasn't clear: How can I modify my grammar within my constraints to achieve my goal of accepting all inputs, preferably with a nice correspondence of rules to types of errors? Do my rules meet my goal?

Solução

This question has been around for a while and covers a topic that is is often visited by beginners in the subject. One often find that those who have done a compilers course in their undergraduate degree know that this is one of those questions that has no easy or single answer. You might have noticed that you have two questions on the same topic, neither of which has been answered. Another question someone else posted was answered with pointers to the literature that explains why this is hard.

It is a question that has remained extant for over 50 years. If one examines the literature over time, from early conference papers, course textbooks, doctoral thesis and (today) SO, we can see regular references to the fact that this is the wrong question! (Or rather the wrong approach to the problem).

Just taking a sample from course texts over the years (random selections from my bookshelf):

Gries, D. (1970) Error Recovery and Correction - An introduction to the Literature, In Compiler Construction, An advanced Course edited by Bauer, F.L. & Eickel, J., Springer Verlag, pp.627-638.
Gries, D. (1971) Compiler Construction for Digital Computers, Wiley, pp.320-326.
Aho, A.V., Ullman, J.D. (1977) Principles of Compiler Design, Addison Wesley, pp.397-405.
Bornat, R. (1979) Understanding and Writing Compilers, Macmillan, pp.251-252.
Hanson, D. (1995) A retargetable C compiler: Design and Implementation,Addison-Wesley, pp.140-146.
Grune, D., Bal, H.E., Jacobs, C.J.H. & Langendoen, K.G. (2000) Modern Compiler Design, Wiley, pp.175-184.
Aho, A.V., Lam, M.S., Sethi, R., Ullman, J.D. (2007) Compilers: Principles, Techniques, & Tools, Pearson, Addison-Wesley, pp.283-296.

All of these (over a 40 year span) concur that your question is about using the wrong tools wrongly or going in the wrong direction. I think I'm trying to say "You can't there from here". You should start from somewhere else.

If you want something deeper, there is a whole Phd thesis:

Charles, P. (1991) A Practical method for Constructing Efficient LALR(k) Parsers with Automatic Error Recovery, New York University

Hopefully, for those who visit this question again in the future, there is a placeholder for the answers.

I note from comments on your other question that you are using MPPG, derived from CPPG. Not everyone will have used these, so I'm including a couple of links to these tools:

Managed Babel Systems Essentials
Garden Points Parser Generator
Irony .NET compiler Construction Kit
Writing your first Visual Studio Language Service

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow