Question

Pardon me if this question is obvious to some, but I'm trying to teach myself how to write an interpreter. I'm doing this in python, I already have a Lexer programmed.

I've got my list of tokens created, where I'm stuck is constructing the parse tree. I have sort of an idea of where to go from here, but I'm not sure if I am thinking correctly.

This is the syntax I have defined in my grammar for a simple arithmetic expression using a regex.

<a_expression> = <identifier | number> <operator> <identifier | number>

BUT, if my parser receives a stream of tokens matching this pattern from my lexer:

<identifier | number> <operator> <identifier | number> <operator> <identifier | number>

How do I go about parsing this, since it has two operators and three operands instead of just two operands?

Moreover, how do I handle n operands and n-1 operators? I feel like this should be done recursively, but I'm not sure if I need to define more Parsers for different types of expressions or where to go from here. Can I match a pattern of n operands and n-1 operators with a regex?

Was it helpful?

Solution 2

Whether a regular expression is apt to parse your syntax, depends on whether your syntax (i.e. your grammar) is regular, too, or of another Chomsky-class.

For type-0 (unrestricted) grammars you will need a Turing machine.

For type-1 (context-sensitive) you will need a linear bounded automaton (or any of the above).

For type-2 (context-free) you will need a pushdown automaton (or any of the above).

And only type-3 (regular) can be read by regular expressions (or any of the above).

You can find further readings e.g. at wikipedia.

OTHER TIPS

While today's 'regular' expressions aren't strictly relegated to the land of Regular Languages, you'll find that you need a more powerful tool to do what you're trying to do.

Context-Free Grammars are what you want, and there are a few tools for writing CFGs in Python. Most notable is pyparsing, but there's a port of Haskell's Parsec library called Pysec that you could look into, too.

Infix arithmetic with precedence is not a regular language. Regular expressions are only good for parsing regular languages. (Modern regex implementations aren't really just regular expressions, and they can in fact parse most context-free languages… but they will take exponential time for some of them, and it's non-trivial to predict which ones.)

But it is a context-free language. See the Wikipedia article on Context-free grammar for a brief explanation. Context-free grammars are good for parsing both regular languages and context-free languages.

However, many languages that are non-regular don't need the full power of CFG.

Two important classes are those that are LL- or LR-parseable (in linear time). (Variants on these, especially LALR and SLR, are also important.) For example, Python can be (and is, at least in the reference implementation) parsed by an LL(1) parser.

Your language fits into an even more restrictive subset of LR(1), OP. In fact, as the name implies ("OP" is short for "Operator Precedence"), it's the paradigm case. And OP parsers are much easier to write by hand than more general parsers. So, if you're going to build a custom parser from scratch, that's what you'd probably want to use here.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top