Question

I'm working on a small compiler in order to get a greater appreciation of the difficulties of creating one's own language. Right now I'm at the stage of adding pointer functionality to my grammar but I got a reduce/reduce conflict by doing it.

Here is a simplified version of my grammar that is compilable by bnfc. I use the happy parser generator and that's the program telling me there is a reduce/reduce conflict.

entrypoints Stmt ;

-- Statements
-------------
SDecl. Stmt ::= Type Ident; -- ex: "int my_var;"
SExpr. Stmt ::= Expr;       -- ex: "printInt(123); "

-- Types
-------------
TInt.      Type ::= "int" ;
TPointer.  Type ::= Type "*" ;
TAlias.    Type ::= Ident ; -- This is how I implement typedefs

-- Expressions
--------------
EMult.     Expr1 ::= Expr1 "*" Expr2 ;
ELitInt.   Expr2 ::= Integer ;
EVariable. Expr2 ::= Ident ;

-- and the standard corecions
_.         Expr  ::= Expr1 ;
_.         Expr1 ::= Expr2 ;

I'm in a learning stage of how grammars work. But I think I know what happens. Consider these two programs

main(){
  int a;
  int b;
  a * b;
}

and

typedef int my_type;
main(){
  my_type * my_type_pointer_variable;
}

(The typedef and main(){} part isn't relevant and in my grammar. But they give some context)

In the first program I wish it would parse a "*" b as Stmt ==(SExpr)==> Expr ==(EMult)==> Expr * Expr ==(..)==> Ident "*" Ident, that is to essentially start stepping using the SExpr rule.

At the same time I would like my_type * my_type_pointer_variable to be expanded using the rules. Stmt ==(SDecl)==> Type Ident ==(TPointer)==> Type "*" Ident ==(TAlias)==> Ident "*" Ident.

But the grammar stage have no idea if an identifier originally is a type alias or a variable.


(1) How can I get rid of the reduce/reduce conflict and (2) am I the only one having this issue? Is there an obvious solution and how does the c grammar resolve this issue?

So far I have successfully just been able to change the syntax of my language by using "&" or some other symbol instead of "*", but that's very undesirable. Also I cannot make sense from various public c grammars and tried to see why they don't have this issue but I have had no luck in this.

And last, how do I resolve issues like these on my own? All I understood from happys more verbose output is how the conflict happens, is cleverness the only way to work around these conflicts? I'm afraid I'll stumble on even more issues for example when introducing EIndir. Expr = '*' Expr;

Was it helpful?

Solution

The usual way this problem is dealt with in C parsers is something generally called "the lexer feedback hack". Its a 'hack' in the sense that it doesn't deal with it in the grammar at all; instead, when the lexer recognizes an identifier, it classifies that identifier as either a typename or a non-typename, and returns a different token for each case (usually designated 'TypeIdent' for an identifier that is a typename and simply 'Ident' for any other). The lexer makes this selection by looking at the current state of the symbol table, so it sees all the typedefs that have occurred prior to the current point in the parse, but not typedefs that are after the current point. This is why C requires that you declare typedefs before their first use in each compilation unit.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top