Grammar: start: (a b)? a c; Input: a d. Which error correct at position 2? 1. expected “b”, “c”. OR expected “c”

https://softwareengineering.stackexchange.com/questions/223010

01-10-2020
|

Question

I am writing a parser (a parser generator for PEGs) and am trying to figure out how to report errors.

Grammar:

rule: (a b)? a c ;

Input:

a d

Question: Which error message correct at position 2 for given input?

expected "b", "c".
expected "c".

So should it be taken into account that "b" was also expected at position?

The 1st error (expected "b", "c") wants to say that the input a b was expected but because it optional it is only a possibility, not a requirement.

I don't know possible is the same as expected or not?

Which error message better and correct: The 1st or the 2nd version?

Thanks for answers.

Second Part

In first case I define marker of testing as limit of position.

if(_inputPos > testing) {
  _failure(_inputPos, _code[cp + {{OFFSET_RESULT}}]);
}

Limit moved in optional expressions:

OPTIONAL_EXPRESSION:
testing = _inputPos;

The "b" expression move _inputPos above the testing pos and add failure at _inputPos.

In second case I can define marker of testing as boolean flag.

if(!testing) {
  _failure(_inputPos, _code[cp + {{OFFSET_RESULT}}]);
}

The "b" expression in this case not add failure because it tested (inner for optional expression).

What you think what is better and correct?

Testing defined as specific position and if expression above this position (_inputPos > testing) it add failure (even it inside optional expression).
Testing defined as flag and if this flag set that the failures not takes into account. After executing optional expression it restore (not reset!) previous value of testing (true or false).

Also failures not takes into account if rule not fails. They only reported if parsing fails.

Third Part

This question raised because it related to two different problems.

First problem:

A Parsing Expression Grammar (PEG) describes input in only three atomic items:

terminal symbol
nonterminal symbol
empty string

The grammar doesn't use a lexer, hence the concept of a “token” does not apply here.

Second problem:

What is a grammar? Can two grammars be considered equal if they accept the same input but produce different result?

Assume we have two grammars:

Grammar 1

rule <- type? identifier

Grammar 2

rule <- type identifier / identifier

They both accept the same input but produce (in PEG) different result.

Grammar 1 results:

{type: type, identifier: identifier}
{type: null, identifier: identifier}

Grammar 2 results:

{type: type, identifier: identifier}
{identifier: identifier}

Questions:

Are both grammars equal?
It is painless to do optimization of grammars?

My answer on both questions is negative. No equal, Not painless.

But you may ask. "But why this happens?".

I can answer to you. "Because this is not a problem. This is a feature".

In a PEG grammar each rule always consists from these parts:

ordered choice
- sequence
  - expression

And this explanation is the my answer on question "But why this happens?".

Another problem.

As a PEG does not have tokens, whitespace cannot be silently ignored in the lexer between tokens. Therefore whitespace has to be considered inside the grammar. Now look at this grammar (in short):

program    <- WHITESPACE expr EOF
expr       <- ruleX
ruleX      <- 'X' WHITESPACE
WHITESPACE <- ' '?
EOF        <- ! .

The first WHITESPACE is at the beginning, other `WHITESPACE (often) occurs at the end of a rule.

In this case in PEG optional WHITESPACE must be assumed as expected.

But WHITESPACE does not only mean space (\x20) but also other characters like [\t\n\r] and even comments.

But the main rule of error messages is the following.

If it is not possible to display all expected elements (or not possible to display at least one from all set of expected elements) then is more correct to not display anything.

More precisely required to display "unexpected" error mesage.

So how would I display in a PEG's error message that WHITESPACE was expected?

Parser error: expected WHITESPACE
Parser error: expected ' ', '\t', '\n' , 'r'

What about start characters of comments? They also may be part of WHITESPACE in some grammars.

In this case optional WHITESPACE will be reject all other potential expected elements because not possible correctly to display WHITESPACE in error message because WHITESPACE is too complex to display.

Is this good or bad?

I think this is not bad and required some tricks to hide this nature of PEG parsers.

And in my PEG parser I not assume that the inner expression at first position of optional (optional & zero_or_more) expression must be treated as expected. But all other inner (except at the first position) must treated as expected.

Example 1:

List<int list; // type? ident

Here "List" is not at the first position in optional "type?".

This failure is taken into account and reported as "expected '>'". This is because we don't skip type but enter into type and after really optional "List" we move position from first to next real "expected" (that already outside of testing position) element.

"List" was in "testing" position.

If inner expression (inside optional expression) "fits in the limitation" not continue at next position then it not assumed as the expected input.

From this assumption has been asked main question.

You must just take into account that we are talking about PEG parsers and their error messages.

Solution

In order to more easily work with the grammar, we will rewrite your rule from rule: (a b)? a c to

rule := "a" "b" "a" "c"
      | "a" "c"

The common prefix can be factored out, so we get:

rule  := "a" rule2
rule2 := "b" "a" "c"
       | "c"

(Grammar A)

However, we can also interpret the original rule as the equivalent of

 rule  := rule2 "a" "c"
 rule2 := "a" "b"
        | SUCCEED

(Grammar B)

where SUCCEED is a pseudo-rule that always succeeds and does not advance the position. Note that these three grammars are absolutely equivalent in terms of the languages which they can recognize.

I will assume that these transformations would be applied automatically. We can now watch how each desugared grammar matches the input a d.

Grammar A (recursive descent with backtracking):
1. Position: |a d, Stack: rule, Rule: "a" rule2.
  Token "a" is recognized and the parser advances.
2. Position: a|d, Stack: rule, Rule: rule2.
  Recurse into rule2.
3. Position: a|d, Stack: rule, rule2, Rule: "b" "a" "c".
  This alternative fails because token "b" doesn't occur in the input.
4. Position: a|d, Stack: rule, rule2, Rule: "c". This alternative fails because token "c" does not occur in the input.
5. Position: a|d, Stack: rule, rule2.
  rule2 fails because neither "b" nor "c" occurs in the input
6. Position: a|d, Stack: rule. rule fails because rule2 fails because neither "b" nor "c" occurs in the input.
Grammar B (recursive descent with backtracking):
1. Position: |a d, Stack: rule, Rule: rule2 "a" "c".
  Recurse into rule2.
2. Position: |a d, Stack: rule, rule2, Rule: "a" "b".
  Token "a" is recognized and the parser advances.
3. Position: a|d, Stack; rule, rule2, Rule: "b".
  This alternative fails because token "b" doesn't occur in the input. Backtrack.
4. Position: |a d, Stack; rule, rule2, Rule: SUCCEED.
  This alternative suceeds. Forget previous failure.
5. Position: |a d, Stack; rule, rule2.
  rule2 suceeds
6. Position: |a d, Stack: rule, Rule: "a" "c".
  Token "a" is recognized and the parser advances.
7. Position: a|d, Stack: rule, Rule: "c". rule fails because token "c" does not occur in the input.

So depending on how you implement and simplify the grammar, and how your error reporting works, you end up with different reasons for failure, but both are equivalent (at the same input position) and correct.

Now if both are equivalent, we might want to pick the “better” one that provides more information. In this case, this would be a rule rewriting as in Grammar A, or a parsing algorithm other than recursive descent:

Earley parsing offers a lot more “situational awareness”.
An LR parser (bottom-up) would also work here and provide more useful error messages.

The drawback is that those are more difficult to do yourself, and rewriting to Grammar B (which is how a hand-written RecDesc usually works implicitly) is simpler to implement.

A note on rule rewriting: This assumes that the grammar is specified as some data structure, and not as hard-coded control flow. Rule rewriting is useful because it can eliminate certain undesirable features of a grammar (e.g. left recursion in top-down parsers).

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange