hand coding a parser

https://stackoverflow.com/questions/2632641

26-09-2019
|

Question

For all you compiler gurus, I wanna write a recursive descent parser and I wanna do it with just code. No generating lexers and parsers from some other grammar and don't tell me to read the dragon book, i'll come around to that eventually.

I wanna get into the gritty details about implementing a lexer and parser for a reasonable simple language, say CSS. And I wanna do this right.

This will probably end up being a series of questions but right now I'm starting with a lexer. Tokenization rules for CSS can be found here.

I find my self writing code like this (hopefully you can infer the rest from this snippet):

public CssToken ReadNext()
{
    int val;
    while ((val = _reader.Read()) != -1)
    {
        var c = (char)val;
        switch (_stack.Top)
        {
            case ParserState.Init:
                if (c == ' ')
                {
                    continue; // ignore
                }
                else if (c == '.')
                {
                    _stack.Transition(ParserState.SubIdent, ParserState.Init);
                }
                break;

            case ParserState.SubIdent:
                if (c == '-')
                {
                    _token.Append(c);
                }
                _stack.Transition(ParserState.SubNMBegin);
                break;

What is this called? and how far off am I from something reasonable well understood? I'm trying to balance something which is fair in terms of efficiency and easy to work with, using a stack to implement some kind of state machine is working quite well, but I'm unsure how to continue like this.

What I have is an input stream, from which I can read 1 character at a time. I don't do any look a head right now, I just read the character then depending on the current state try to do something with that.

I'd really like to get into the mind set of writing reusable snippets of code. This Transition method is currently means to do that, it will pop the current state of the stack and then push the arguments in reverse order. That way, when I write Transition(ParserState.SubIdent, ParserState.Init) it will "call" a sub routine SubIdent which will, when complete, return to the Init state.

The parser will be implemented in much the same way, currently, having everything in a single big method like this allows me to easily return a token when I found one, but it also forces me to keep everything in one single big method. Is there a nice way to split these tokenization rules into separate methods?

Solution

What you're writing is called a pushdown automaton. This is usually more power than you need to write a lexer, it's certainly excessive if you're writing a lexer for a modern language like CSS. A recursive descent parser is close in power to a pushdown automaton, but recursive descent parsers are much easier to write and to understand. Most parser generators generate pushdown automatons.

Lexers are almost always written as finite state machines, i.e., like your code except get rid of the "stack" object. Finite state machines are closely related to regular expressions (actually, they're provably equivalent to one another). When designing such a parser, one usually starts with the regular expressions and uses them to create a deterministic finite automaton, with some extra code in the transitions to record the beginning and end of each token.

There are tools to do this. The lex tool and its descendants are well known and have been translated into many languages. The ANTLR toolchain also has a lexer component. My preferred tool is ragel on platforms that support it. There is little benefit to writing a lexer by hand most of the time, and the code generated by these tools will probably be faster and more reliable.

If you do want to write your own lexer by hand, good ones often look something like this:

function readToken() // note: returns only one token each time
    while !eof
        c = peekChar()
        if c in A-Za-z
            return readIdentifier()
        else if c in 0-9
            return readInteger()
        else if c in ' \n\r\t\v\f'
            nextChar()
        ...
    return EOF

function readIdentifier()
    ident = ""
    while !eof
        c = nextChar()
        if c in A-Za-z0-9
            ident.append(c)
        else
            return Token(Identifier, ident)
            // or maybe...
            return Identifier(ident)

Then you can write your parser as a recursive descent parser. Don't try to combine lexer / parser stages into one, it leads to a total mess of code. (According to the Parsec author, it's slower, too).

OTHER TIPS

If you are going to hand code everything from scratch I would definately consider going with a recursive decent parser. In your post you are not really saying what you will be doing with the token stream once you have parsed the source.

Some things I would recommend getting a handle on
1. Good design for your scanner/lexer, this is what will be tokenizing your source code for your parser.
2. The next thing is the parser, if you have a good ebnf for the source language the parser can usually translate quite nicely into a recursive decent parser.
3. Another data structure you will really need to get your head around is the symbol table. This can be as simple as a hashtable or as complex as a tree structure that can represent complex record structures etc. I think for CSS you might be somewhere between the two.
4. And finally you want to deal with code generation. You have many options here. For an interpreter, you might simply interpret on the fly as you parse the code. A better approach might be to generate a for of i-code that you can then write an iterpreter for, and later even a compiler. Of course for the .NET platform you could directly generate IL (probably not applicable for CSS :))

For references, I gather you are not heavy into the deep theory and I do not blame you. A really good starting point for getting the basics without complex, code if you do not mind the Pascal that is, is Jack Crenshaw's 'Let's build a compiler'
http://compilers.iecc.com/crenshaw/
Good luck I am sure you are going to enjoy this project.

You need to write your own Recursive Descent Parser from your BNF/EBNF. I had to write my own recently and this page was a lot of help. I'm not sure what you mean by "with just code". Do you mean you want to know how to write your own recursive parser?

If you want to do that, you need to have your grammar in place first. Once you have your EBNF/BNF in place, the parser can be written quite easily from it.

The first thing I did when I wrote my parser, was to read everything in and then tokenize the text. So I essentially ended up with an array of tokens that I treated as a stack. To reduce the verbosity/overhead of pulling a value off a stack and then pushing it back on if you don't require it, you can have a peek method that simply returns the top value on the stack without popping it.

UPDATE

Based on your comment, I had to write a recursive-descent parser in Javascript from scratch. You can take a look at the parser here. Just search for the constraints function. I wrote my own tokenize function to tokenize the input as well. I also wrote another convenience function (peek, that I mentioned before). The parser parses according to the EBNF here.

This took me a little while to figure out because it's been years since I wrote a parser (last time I wrote it was in school!), but trust me, once you get it, you get it. I hope my example gets your further along on your way.

ANOTHER UPDATE

I also realized that my example may not be what you want because you might be going towards using a shift-reduce parser. You mentioned that right now you are trying to write a tokenizer. In my case, I did write my own tokenizer in Javascript. It's probably not robust, but it was sufficient for my needs.

 function tokenize(options) {
            var str = options.str;
            var delimiters = options.delimiters.split("");
            var returnDelimiters = options.returnDelimiters || false;
            var returnEmptyTokens = options.returnEmptyTokens || false;
            var tokens = new Array();
            var lastTokenIndex = 0;

            for(var i = 0; i < str.length; i++) {
                if(exists(delimiters, str[i])) {
                    var token = str.substring(lastTokenIndex, i);

                    if(token.length == 0) {
                        if(returnEmptyTokens) {
                            tokens.push(token);
                        }
                    }

                    else {
                        tokens.push(token);
                    }

                    if(returnDelimiters) {
                        tokens.push(str[i]);
                    }

                    lastTokenIndex = i + 1;
                }
            }

            if(lastTokenIndex < str.length) {
                var token = str.substring(lastTokenIndex, str.length);
                token = token.replace(/^\s+/, "").replace(/\s+$/, "");

                if(token.length == 0) {
                    if(returnEmptyTokens) {
                        tokens.push(token);
                    }
                }

                else {
                    tokens.push(token);
                }
            }

            return tokens;
        }

Based on your code, it looks like you are reading, tokenizing, and parsing at the same time - I'm assuming that's what a shift-reduce parser does? The flow for what I have is tokenize first to build the stack of tokens, and then send the tokens through the recursive-descent parser.

It looks like you want to implement a "shift-reduce" parser, where you explicitly build a token stack. The usual alternative is a "recursive descent" parser, in which depth of procedure calls build the same token stack with their own local variables, on the actual hardware stack.

In shift-reduce, the term "reduce" refers to the operation performed on the explicitly-maintained token stack. For example, if the top of the stack has become Term, Operator, Term then a reduction rule can be applied resulting in Expression as a replacement for the pattern. The reduction rules are explicitly encoded in a data structure used by the shift-reduce parser; as a result, all reduction rules can be found in the same spot of the source code.

The shift-reduce approach brings a few benefits compared to recursive-descent. On a subjective level, my opinion is that shift-reduce is easier to read and maintain than recursive-descent. More objectively, shift-reduce allows for more informative error messages from the parser when an unexpected token occurs.

Specifically, because the shift-reduce parser has an explicit encoding of rules for making "reductions," the parser is easily extended to articulate what sorts of tokens could legally have followed. (e.g., "; expected"). A recursive descent implementation cannot easily be extended to do the same thing.

A great book on both kinds of parser, and the trade-offs in implementing different kinds of shift-reduce is "Introduction to Compiler Construction", by Thomas W. Parsons.

Shift-reduce is sometimes called "bottom-up" parsing and recursive-descent is sometimes called "top-down" parsing. In the analogy used, nodes composed with highest precedence (e.g., "factors" in multiplication expression) are considered to be "at the bottom" of the parsing. This is in accord with the same analogy used in "descent" of "recursive descent".

If you want to use the parser to also handle not-well-formed expressions, you really want a recursive descent parser. Much easier to get the error handling and reporting usable.

For literature, I'd recommend some of the old work of Niklaus Wirth. He knows how to write. Algorithms + Data Structures = Programs is what I used, but you can find his Compiler Construction online.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow