Question

My day job includes working to develop a Pascal-like compiler. I've been working all along on optimizations and code generation.

I would also like to start learning to build a simple parser for the same language. I'm however, not really sure how to go about this. Flex and Bison seem to be the choice. But, isn't it possible to write a parser using C++ or C#? I'm a bit creepy with C.

Yacc++ supports C#, but it's a licensed one. I'm looking for all the help that I can find in this regard. Suggestions would be highly appreciated.

Was it helpful?

Solution

I believe you can use ANTLR with C#. I've never tried it myself (yet), however there is a tutorial here that might point you in the right direction.

OTHER TIPS

Personally, I roll my own lexer and parser (LL). Here's a very-abbreviated example. It is in C++, but hopefully you can adapt it. It makes use of a macro PARSE_HIGHER to make it easy to insert operators at different precedence levels without much code changing.

 // routine to scan over whitespace/comments
void ScanWhite(const char* &pc){
  while(true){
    if(0);
    else if (WHITESPACE(*pc)) pc++;
    else if (pc[0]=='/' && pc[1]=='/'){
      while(*pc && *pc++ != '\n');
    }
    else break;
  }
}
// routine to lex an identifier
bool SeeId(const char* &pc, string sId){
  ScanWhite(pc);
  const char* pc0 = pc;
  if (alpha(*pc)){
    sId = "";
    while(alphanum(*pc)) sId += (*pc++);
    return true;
  }
  pc = pc0;
  return false;
}
// routine to lex a number
bool SeeNum(const char* &pc, double &dNum){
  ScanWhite(pc);
  const char* pc0 = pc;
  if (digit(*pc)){
    dNum = 0;
    while(digit(*pc)) dNum = dNum * 10 + (*pc++ - '0');
    if (*pc == '.'){
      double divisor = 1, frac = 0;
      while(digit(*pc)){
        divisor *= 0.1;
        frac += (*pc++ - '0') * divisor;
      }
      dNum += frac;
    }
    return true;
  }
  pc = pc0;
  return false;
}
// routine to lex some constant word
bool SeeWord(const char* &pc, const char* sWord){
  ScanWhite(pc);
  const char* pc0 = pc;
  int len = strlen(sWord);
  if (strncmp(pc, sWord, len)==0 && !alphanum(pc[len])){
    pc += len;
    return true;
  }
  pc = pc0;
  return false;
}
// routine to lex a single character like an operator
bool SeeChar(const char* &pc, const char c){
  ScanWhite(pc);
  const char* pc0 = pc;
  if (*pc == c){
    pc++;
    return true;
  }
  pc = pc0;
  return false;
}
// primitive expression parser
void ParsePrimitiveExpr(const char* &pc, CNode* &p){
  double dNum;
  char sId[100];
  if (0);
  else if (SeeNum(pc, dNum)){
    p = new CNode(dNum);
  }
  else if (SeeId(pc, sId)){
    // see if its a function call
    if (SeeChar(pc, '(')){
      p = MakeNewFunctionCallNode(sId);
      while(!SeeChar(pc, ')')){
        CNode* p1 = null;
        ParseExpression(pc, p1);
        AddArgumentExprToFunctionCallNode(p, p1);
        SeeChar(pc, ','); /* optional comma separator */
      }
    }
    // otherwise its just a variable reference
    else {
      p = new CNode(sId);
    }
  }
  // handle embedded expressions
  else if (SeeChar(pc, '(')){
    ParseExpression(pc, p);
    if (!SeeChar(pc, ')')) /* deal with syntax error */
  }
}
#define PARSE_HIGHER ParsePrimitiveExpr
// product parser
void ParseProduct(const char* &pc, CNode* &p){
  PARSE_HIGHER(pc, p);
  while(true){
    if (0);
    else if (SeeChar(pc, '*')){
      CNode p1 = null;
      PARSE_HIGHER(pc, p1);
      p = new CNode('*', p, p1);
    }
    else if (SeeChar(pc, '/')){
     CNode p1 = null;
     PARSE_HIGHER(pc, p1);
     p = new CNode('/', p, p1);
   }
   else break;
  }
}
#undef  PARSE_HIGHER
#define PARSE_HIGHER ParseProduct
// sum parser
void ParseSum(const char* &pc, CNode* &p){
  PARSE_HIGHER(pc, p);
  while(true){
    if (0);
    else if (SeeChar(pc, '+')){
      CNode p1 = null;
      PARSE_HIGHER(pc, p1);
      p = new CNode('+', p, p1);
    }
    else if (SeeChar(pc, '-')){
      CNode p1 = null;
      PARSE_HIGHER(pc, p1);
      p = new CNode('-', p, p1);
    }
   else break;
  }
}
#undef  PARSE_HIGHER
// can insert more routines like the above
// to handle move operators
#define PARSE_HIGHER ParseSum
// overall expression parser
void ParseExpression(const char* &pc, CNode* &p){
  PARSE_HIGHER(pc, p);
}

Added some Pascal-style statement syntax:

void ParseStatement(const char* &pc){
  char sId[100];
  if(0);
  else if (SeeWord(pc, "begin")){
    while(!SeeWord(pc, "end")){
      ParseStatement(pc);
      SeeChar(pc, ';');
    }
  }
  else if (SeeWord(pc, "while")){
    CNode* p1 = null;
    ParseExpression(pc, p1);
    ParseStatement(pc);
    /* semantics for while statement */
  }
  else if (SeeWord(pc, "if")){
    CNode* p1 = null;
    ParseExpression(pc, p1);
    ParseStatement(pc);
    if (SeeWord(pc, "else")){
      ParseStatement(pc);
    }
    /* semantics for if statement */
  }
  else if (SeeWord(pc, "for")){
    /* you do it */
  }
  // handle assignments and subroutine calls
  else if (SeeId(pc, sId)){
    if(0);
    else if (SeeChar(pc, '=')){
      CNode* p1 = null;
      ParseExpression(pc, p1);
      /* semantics for assignment statement */
    }
    else if (SeeChar(pc, '(')){
      CNode* p = MakeNewFunctionCallNode(sId);
      while(!SeeChar(pc, ')')){
        CNode* p1 = null;
        ParseExpression(pc, p1);
        AddArgumentExprToFunctionCallNode(p, p1);
        SeeChar(pc, ','); /* optional comma separator */
      }
    }
    else {
      /* we have a 1-word statement, which is OK in pascal */
    }
  }
  else {
    /* syntax error */
  }
}

It still needs syntax for array indexing, variable declaration, and function definition, but I hope it is clear how to do that.

In his classic programming text, Algorithms + Data Structures = Programs, Niklaus Wirth develops an entire recursive descent parser (in Pascal) for a simple Pascal0-like language.

If you were writing it in Java I'd recommend ANTLR. It's a nice LL(*) parser-generator written in Java. There's a terrific book for it on Amazon, too.

bison & flex are the canonical parser generators. If you're interested in C++, I've found boost spirit useful. I've never used it for anything as complex as a compiler though. I'm sure others will have interesting suggestions for other languages such as C#...

You can actually use flex & bison with C++. In this tutorial, for example, you can see that section 5 is dedicated to that matter. Just google for it, and I'm sure you will find lots of examples.

I've written an XSLT parser with flex and bison. More lately I'm doing a project using ANTLR, though:

is JFig language syntax efficient and clear (and better than Spring-Framework’s XML DSL)?

I've liked working in ANTLR much more so than Flex and Bison. ANTLR puts you up at a higher level of abstraction in some respects. The lexical definitions and parser grammar can all go in one file. (ANTLR will generate the token file.)

One of the key items is the ability to define tree grammars. Basically you do a grammar parse over the input language and have actions that rewrite to a highly optimal AST tree output (which remain as linked data structures in memory). You then can pass this tree to another parser defined in a separate tree parser file. The tree parser is where you do the real work of the action items you want.

This is a nice approach as you can keep the AST form and repeatedly reprocess it as needed - peeling off specific subtree nodes to process against based on latter actions, etc. Think of something like a language interpretor. Instead of going into a for loop and processing the language from the ground up over and over again, can just process through it's AST representation.

In my case I've devised a bean factory for doing IoC dependency injection. My bean factory keeps the AST of a bean descriptor around at runtime. When it needs to make (or retrieve) a new bean instance, I just pass the bean descriptor AST subtree to my tree parser - the outcome is the desired bean instance (there are a lot of factors that go in to determining how to instantiate the requested bean, including making any other beans that are referenced and/or applying other special behaviors via meta attributes).

Finally, my current bean factory is targeting Java, but I want to target ActionScript3 and C# .NET. ANTLR has support for doing that.

As mentioned, Terrence Parr has written a book on how to use ANTLR. It is aimed at working programmers that need to do something practical with ANTLR (as opposed to an academic treatment of the subject).

If your wanting C# as per this Question try Gardens Point GPPG and GPLEX.

When you use Lex and Yacc you don't actually write much of anything in C. Lex is its own language, as is Yacc. So you write the lexical analyzer in Lex and the parser in Yacc. However, for Pascal, Lex and Yacc inputs are already available.

The resulting parser and lexer have C interfaces, that is true. However most languages, including C++, have simple ways to call (or wrap) C interfaces.

I'm not an expert in it, but I'm sure all of the above goes for ANTLR as well.

If you are asking to do it in "pure C++" (whatever that means), look into using boost spirit. I don't really see the point in theoretical purity if it causes a ton more work though.

Writing your own lexers and parsers by hand is actually kinda fun. A lexer is one of the very few situations where you can justify using both gotos and the preprocessor. However, I wouldn't suggest it for a full-blown language like Pascal if you can avoid it. That would be a lot of work. I'm talking man-years.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top