Question

As a pet-project, I'd like to attempt to implement a basic language of my own design that can be used as a web-scripting language. It's trivial to run a C++ program as an Apache CGI, so the real work lies in how to parse an input file containing non-code (HTML/CSS markup) and server-side code.

In my undergrad compiler course, we used Flex and Bison to generate a scanner and a parser for a simple language. We were given a copy of the grammar and wrote a parser that translated the simple language to a simple assembly for a virtual machine. The flex scanner tokenizes the input, and passes the tokens to the Bison parser.

The difference between that and what I'd like to do is that like PHP, this language could have plain HTML markup and the scripting language interspersed like the following:

<p>Hello,
<? echo "World ?>
</p>

Am I incorrect in assuming that it would be efficient to parse the input file as follows:

  1. Scan input until a script start tag is found ('
  2. Second scanner tokenizes the server-side script section of the input file (from the open tag: '') and passes the token to the parser, which has no need to know about the markup in the file.
  3. Control is returned to the first scanner that continues this general pattern.

Basically, the first scanner only differentiates between Markup (which is returned directly to the browser unmodified) and code, which is passed to the second scanner, which in turn tokenizes the code and passes the tokens to the parser.

If this is not a solid design pattern, how do languages such as PHP handle scanning input and parsing code efficiently?

Was it helpful?

Solution

You want to look at start conditions. For example:

"<?"            { BEGIN (PHP); }
<PHP>[a-zA-Z]*  { return PHP_TOKEN; }
<PHP>">?"       { BEGIN (0); }
[a-zA-Z]*       { return HTML_TOKEN; }

You start off in state 0, use the BEGIN macro to change states. To match a RE only while in a particular state, prefix the RE with the state name surrounded by angle-brackets.

In the example above, "PHP" is state. "PHP_TOKEN" and "HTML_TOKEN" are _%token_s defined by your yacc file.

OTHER TIPS

PHP doesn't differentiate between the scanning and the Markup. It simply outputs to buffer when in Markup mode, and then switches to parsing when in code mode. You don't need a two pass scanner, and you can do this with just a single flex lexer.

If you are interested in how PHP itself works, download the source (try the PHP4 source it is a lot easier to understand). What you want to look at is in the Zend Directory, zend_language_scanner.l.

Having written something similar myself, I would really recommend rethinking going the Flex and Bison route, and go with something modern like Antlr. It is a lot easier, easier to understand (the macros employed in a lex grammar get very confusing and hard to read) and it has a built in debugger (AntlrWorks) so you don't have to spend hours looking at 3 Meg debug files. It also supports many languages (Java, c#, C, Python, Actionscript) and has an excellent book and a very good website that should be able to get you up and running in no time.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top