Correct way of parsing S-expressions in OOP

https://stackoverflow.com/questions/9435400

12-11-2019
|

Question

I am looking for a way to implement an S-expression reader (to be used later on with both an Scheme interpreter and a compiler), but I've been asking myself how (if at all) I should write an AST for it.

I've been reading SICP, and this is quite straightforward from within Scheme, but I'm looking to implement the interpreter and compiler in C++, in OO fashion.

Please, keep in mind that I'm doing this only for learning purposes, so I'm not really looking for the easiest or quickest way of doing this, but rather the correct and reusable way of doing it.

I've seen in some Scheme implementations that people parse s-expressions and readily output cons cells, something like this:

  struct Sexpr
  {
  };

  struct Cons : public Sexpr
  {
    Sexpr* left;
    Sexpr* right;
  };

  struct IntAtom : Sexpr
  {
    int value;
  };

And one subclass of Sexpr for each kind of Scheme Atom, or something along those lines.

I'm not sure, but this seems like a hack to me... Shouldn't this work be done by an interpreter rather than the reader?

What I want to know is if this is considered the best (or correct) way of reading S-expressions, or is this more a job of the interpreter than the parser? Should the parser have its own AST instead of relying on cons cells?

Solution

To follow up from the Scheme/Racket side of the fence:

Racket (and some other Scheme implementations) use a richer representation for syntax objects, so that they can have properties attached to them indicating (in Racket, at least) what context they're bound in, what source location they come from, what pass of the compiler inserted them, and any other information you might want to store (cf. "syntax properties" in Racket).

This additional information enables things like error messages with pointers to source, and hygienic macros.

Please note that I mean "richer" here simply in the "contains more values" sense, not in any non-value-neutral way.

I should also add---before falling into the Turing Tar Pit---that you can also represent this exact same information using a table on the side; assuming you have pointer comparisons, there's no expressiveness difference between putting a value inside a structure and using a table to associate the structure with the value.

OTHER TIPS

If you want to be somewhat complete in your syntax, you will need to support

sexpr ::= atom | sexpr sexpr
atom ::= nil | intatom | etc.

But that is more general than most sexpr you will encounter. The easiest and most common form of S-expr which in LISP/Scheme is like (a b c d) where each of a,b,c,d are atoms or lists. In pair form this is [a [b [c [d nil] ] ] ], which means all right sides of your conses are lists.

So if you are going for clean, you might just do

class sexpr {};
class atom : sexpr {};
class s_list : forward_list<smart_ptr<sexpr>> {};

While one can probably argue back and forth over what the “correct” approach is, in my opinion, the approach you suggest—using the same data structures for reading, compilation, evaluation, and processing—is the one that will teach you the most about what Lisp and the “code is data” mantra are about, and in particular, what the quote operator actually means (which is quite a profound thing).

It is also, incidentally, the way most Lisps (interestingly, not including Scheme) traditionally work.

So yes, have the reader generate Lisp data: conses, symbols, Lisp numbers, strings, et cetera, the exact same stuff the user-level Lisp code will deal with. It will make the rest of the implementation both simpler and more instructive.

You might want to take a look at this c/c++ s-expr parser library for an example of how it has been done.

It looks like the base representation is:

struct elt {
  int type;
  char *val;
  struct elt *list; 
  struct elt *next;
};

And I quote from their docs:

Since an element can be either a list or atom, the element structure has a type indicator that can be either LIST or VALUE. If the type indicator is LIST, then the structure member "list" will be a pointer to the head of the list represented by this element. If the type indicator is VALUE, then the structure member "val" will contain the atom represented by the element as a string. In both cases, the "next" pointer will point at the next element of the current s-expression.

Additionally here is a whole list of other implementations of s-expr readers in lots of languages that may be of interest.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow