Question

Preferred languages: C/C++, Java, and Ruby.

I am looking for some helpful books/tutorials on how to write your own compiler simply for educational purposes. I am most familiar with C/C++, Java, and Ruby, so I prefer resources that involve one of those three, but any good resource is acceptable.

Was it helpful?

Solution

Big List of Resources:

Legend:

  • ¶ Link to a PDF file
  • $ Link to a printed book

OTHER TIPS

This is a pretty vague question, I think; just because of the depth of the topic involved. A compiler can be decomposed into two separate parts, however; a top-half and a bottom-one. The top-half generally takes the source language and converts it into an intermediate representation, and the bottom half takes care of the platform specific code generation.

Nonetheless, one idea for an easy way to approach this topic (the one we used in my compilers class, at least) is to build the compiler in the two pieces described above. Specifically, you'll get a good idea of the entire process by just building the top-half.

Just doing the top half lets you get the experience of writing the lexical analyzer and the parser and go to generating some "code" (that intermediate representation I mentioned). So it will take your source program and convert it to another representation and do some optimization (if you want), which is the heart of a compiler. The bottom half will then take that intermediate representation and generate the bytes needed to run the program on a specific architecture. For example, the the bottom half will take your intermediate representation and generate a PE executable.

Some books on this topic that I found particularly helpful was Compilers Principles and Techniques (or the Dragon Book, due to the cute dragon on the cover). It's got some great theory and definitely covers Context-Free Grammars in a really accessible manner. Also, for building the lexical analyzer and parser, you'll probably use the *nix tools lex and yacc. And uninterestingly enough, the book called "lex and yacc" picked up where the Dragon Book left off for this part.

I think Modern Compiler Implementation in ML is the best introductory compiler writing text. There's a Java version and a C version too, either of which might be more accessible given your languages background. The book packs a lot of useful basic material (scanning and parsing, semantic analysis, activation records, instruction selection, RISC and x86 native code generation) and various "advanced" topics (compiling OO and functional languages, polymorphism, garbage collection, optimization and single static assignment form) into relatively little space (~500 pages).

I prefer Modern Compiler Implementation to the Dragon book because Modern Compiler implementation surveys less of the field--instead it has really solid coverage of all the topics you would need to write a serious, decent compiler. After you work through this book you'll be ready to tackle research papers directly for more depth if you need it.

I must confess I have a serious soft spot for Niklaus Wirth's Compiler Construction. It is available online as a PDF. I find Wirth's programming aesthetic simply beautiful, however some people find his style too minimal (for example Wirth favors recursive descent parsers, but most CS courses focus on parser generator tools; Wirth's language designs are fairly conservative.) Compiler Construction is a very succinct distillation of Wirth's basic ideas, so whether you like his style or not or not, I highly recommend reading this book.

I concur with the Dragon Book reference; IMO, it is the definitive guide to compiler construction. Get ready for some hardcore theory, though.

If you want a book that is lighter on theory, Game Scripting Mastery might be a better book for you. If you are a total newbie at compiler theory, it provides a gentler introduction. It doesn't cover more practical parsing methods (opting for non-predictive recursive descent without discussing LL or LR parsing), and as I recall, it doesn't even discuss any sort of optimization theory. Plus, instead of compiling to machine code, it compiles to a bytecode that is supposed to run on a VM that you also write.

It's still a decent read, particularly if you can pick it up for cheap on Amazon. If you only want an easy introduction into compilers, Game Scripting Mastery is not a bad way to go. If you want to go hardcore up front, then you should settle for nothing less than the Dragon Book.

"Let's Build a Compiler" is awesome, but it's a bit outdated. (I'm not saying it makes it even a little bit less valid.)

Or check out SLANG. This is similar to "Let's Build a Compiler" but is a much better resource especially for beginners. This comes with a pdf tutorial which takes a 7 step approach at teaching you a compiler. Adding the quora link as it have the links to all the various ports of SLANG, in C++, Java and JS, also interpreters in python and java, originally written using C# and the .NET platform.

If you're looking to use powerful, higher level tools rather than building everything yourself, going through the projects and readings for this course is a pretty good option. It's a languages course by the author of the Java parser engine ANTLR. You can get the book for the course as a PDF from the Pragmatic Programmers.

The course goes over the standard compiler compiler stuff that you'd see elsewhere: parsing, types and type checking, polymorphism, symbol tables, and code generation. Pretty much the only thing that isn't covered is optimizations. The final project is a program that compiles a subset of C. Because you use tools like ANTLR and LLVM, it's feasible to write the entire compiler in a single day (I have an existence proof of this, though I do mean ~24 hours). It's heavy on practical engineering using modern tools, a bit lighter on theory.

LLVM, by the way, is simply fantastic. Many situations where you might normally compile down to assembly, you'd be much better off compiling to LLVM's Intermediate Representation instead. It's higher level, cross platform, and LLVM is quite good at generating optimized assembly from it.

If you have little time, I recommend Niklaus Wirth's "Compiler Construction" (Addison-Wesley. 1996), a tiny little booklet that you can read in a day, but it explains the basics (including how to implement lexers, recursive descent parsers, and your own stack-based virtual machines). After that, if you want a deep dive, there's no way around the Dragon book as other commenters suggest.

You might want to look into Lex/Yacc (or Flex/Bison, whatever you want to call them). Flex is a lexical analyzer, which will parse and identify the semantic components ("tokens") of your language, and Bison will be used to define what happens when each token is parsed. This could be, but is definitely not limited to, printing out C code, for a compiler that would compile to C, or dynamically running the instructions.

This FAQ should help you, and this tutorial looks quite useful.

Generally speaking, there's no five minutes tutorial for compilers, because it's a complicated topic and writing a compiler can take months. You will have to do your own search.

Python and Ruby are usually interpreted. Perhaps you want to start with an interpreter as well. It's generally easier.

The first step is to write a formal language description, the grammar of your programming language. Then you have to transform the source code that you want to compile or interpret according to the grammar into an abstract syntax tree, an internal form of the source code that the computer understands and can operate on. This step is usually called parsing and the software that parses the source code is called a parser. Often the parser is generated by a parser generator which transform a formal grammar into source oder machine code. For a good, non-mathematical explanation of parsing I recommend Parsing Techniques - A Practical Guide. Wikipedia has a comparison of parser generators from which you can choose that one that is suitable for you. Depending on the parser generator you chose, you will find tutorials on the Internet and for really popular parser generators (like GNU bison) there are also books.

Writing a parser for your language can be really hard, but this depends on your grammar. So I suggest to keep your grammar simple (unlike C++); a good example for this is LISP.

In the second step the abstract syntax tree is transformed from a tree structure into a linear intermediate representation. As a good example for this Lua's bytecode is often cited. But the intermediate representation really depends on your language.

If you are building an interpreter, you will simply have to interpret the intermediate representation. You could also just-in-time-compile it. I recommend LLVM and libjit for just-in-time-compilation. To make the language usable you will also have to include some input and output functions and perhaps a small standard library.

If you are going to compile the language, it will be more complicated. You will have to write backends for different computer architectures and generate machine code from the intermediate representation in those backends. I recommend LLVM for this task.

There are a few books on this topic, but I can recommend none of them for general use. Most of them are too academic or too practical. There's no "Teach yourself compiler writing in 21 days" and thus, you will have to buy several books to get a good understanding of this entire topic. If you search the Internet, you will come across some some online books and lecture notes. Maybe there's a university library nearby you where you can borrow books on compilers.

I also recommend a good background knowledge in theoretical computer science and graph theory, if you are going to make your project serious. A degree in computer science will also be helpful.

One book not yet suggested but very important is "Linkers and Loaders" by John Levine. If you're not using an external assembler, you'll need a way to output a object file that can be linked into your final program. Even if you're using an external assembler, you'll probably need to understand relocations and how the whole program loading process works to make a working tool. This book collects a lot of the random lore around this process for various systems, including Win32 and Linux.

The Dragon Book is definitely the "building compilers" book, but if your language isn't quite as complicated as the current generation of languages, you may want to look at the Interpreter pattern from Design Patterns.

The example in the book designs a regular expression-like language and is well thought through, but as they say in the book, it's good for thinking through the process but is effective really only on small languages. However, it is much faster to write an Interpreter for a small language with this pattern than having to learn about all the different types of parsers, yacc and lex, et cetera...

If you're willing to use LLVM, check this out: http://llvm.org/docs/tutorial/. It teaches you how to write a compiler from scratch using LLVM's framework, and doesn't assume you have any knowledge about the subject.

The tutorial suggest you write your own parser and lexer etc, but I advise you to look into bison and flex once you get the idea. They make life so much easier.

I found the Dragon book much too hard to read with too much focus on language theory that is not really required to write a compiler in practice.

I would add the Oberon book which contains the full source of an amazingly fast and simple Oberon compiler Project Oberon.

Alt text

I remember asking this question about seven years ago when I was rather new to programming.

I was very careful when I asked and surprisingly I didn't get as much criticism as you are getting here. They did however point me in the direction of the "Dragon Book" which is in my opinion, a really great book that explains everything you need to know to write a compiler (you will of course have to master a language or two. The more languages you know, the merrier.).

And yes, many people say reading that book is crazy and you won't learn anything from it, but I disagree completely with that.

Many people also say that writing compilers is stupid and pointless. Well, there are a number of reasons why compiler development are useful:

  • Because it's fun.
  • It's educational, when learning how to write compilers you will learn a lot about computer science and other techniques that are useful when writing other applications.
  • If nobody wrote compilers the existing languages wouldn't get any better.

I didn't write my own compiler right away, but after asking I knew where to start. And now, after learning many different languages and reading the Dragon Book, writing isn't that much of a problem. (I'm also studying computer engineering atm, but most of what I know about programming is self taught.)

In conclusion, The Dragon Book is a great "tutorial". But spend some time mastering a language or two before attempting to write a compiler. Don't expect to be a compiler guru within the next decade or so though.

The book is also good if you want to learn how to write parsers/interpreters.

"... Let's Build a Compiler ..."

I'd second http://compilers.iecc.com/crenshaw/ by @sasb. Forget buying more books for the moment.

Why? Tools & language.

The language required is Pascal and if I remember correctly is based on Turbo-Pascal. It just so happens if you go to http://www.freepascal.org/ and download the Pascal compiler all the examples work straight from the page ~ http://www.freepascal.org/download.var The beaut thing about Free Pascal is you can use it almost whatever processor or OS you can care for.

Once you have mastered the lessons then try the more advanced "Dragon Book" ~ http://en.wikipedia.org/wiki/Dragon_book

I am looking into the same concept, and found this promising article by Joel Pobar,

Create a Language Compiler for the .NET Framework - not sure where this has gone

Create a Language Compiler for the .NET Framework - pdf copy of the original doc

he discusses a high level concept of a compiler and proceeds to invent his own langauge for the .Net framework. Although its aimed at the .Net Framework, many of the concepts should be able to be reproduced. The Article covers:

  1. Langauge definition
  2. Scanner
  3. Parser (the bit im mainly interested in)
  4. Targeting the .Net Framework The
  5. Code Generator

there are other topics, but you get the just.

Its aimed to people starting out, written in C# (not quite Java)

HTH

bones

An easy way to create a compiler is to use bison and flex (or similar), build a tree (AST) and generate code in C. With generating C code being the most important step. By generating C code, your language will automatically work on all platforms that have a C compiler.

Generating C code is as easy as generating HTML (just use print, or equivalent), which in turn is much easier than writing a C parser or HTML parser.

From the comp.compilers FAQ:

"Programming a Personal Computer" by Per Brinch Hansen Prentice-Hall 1982 ISBN 0-13-730283-5

This unfortunately-titled book explains the design and creation of a single-user programming environment for micros, using a Pascal-like language called Edison. The author presents all source code and explanations for the step-by-step implementation of an Edison compiler and simple supporting operating system, all written in Edison itself (except for a small supporting kernel written in a symbolic assembler for PDP 11/23; the complete source can also be ordered for the IBM PC).

The most interesting things about this book are: 1) its ability to demonstrate how to create a complete, self-contained, self-maintaining, useful compiler and operating system, and 2) the interesting discussion of language design and specification problems and trade-offs in Chapter 2.

"Brinch Hansen on Pascal Compilers" by Per Brinch Hansen Prentice-Hall 1985 ISBN 0-13-083098-4

Another light-on-theory heavy-on-pragmatics here's-how-to-code-it book. The author presents the design, implementation, and complete source code for a compiler and p-code interpreter for Pascal- (Pascal "minus"), a Pascal subset with boolean and integer types (but no characters, reals, subranged or enumerated types), constant and variable definitions and array and record types (but no packed, variant, set, pointer, nameless, renamed, or file types), expressions, assignment statements, nested procedure definitions with value and variable parameters, if statements, while statements, and begin-end blocks (but no function definitions, procedural parameters, goto statements and labels, case statements, repeat statements, for statements, and with statements).

The compiler and interpreter are written in Pascal* (Pascal "star"), a Pascal subset extended with some Edison-style features for creating software development systems. A Pascal* compiler for the IBM PC is sold by the author, but it's easy to port the book's Pascal- compiler to any convenient Pascal platform.

This book makes the design and implementation of a compiler look easy. I particularly like the way the author is concerned with quality, reliability, and testing. The compiler and interpreter can easily be used as the basis for a more involved language or compiler project, especially if you're pressed to quickly get something up and running.

You should check out Darius Bacon's "ichbins", which is a compiler for a small Lisp dialect, targeting C, in just over 6 pages of code. The advantage it has over most toy compilers is that the language is complete enough that the compiler is written in it. (The tarball also includes an interpreter to bootstrap the thing.)

There's more stuff about what I found useful in learning to write a compiler on my Ur-Scheme web page.

Python comes bundled with a python compiler written in Python. You can see the source code, and it includes all phases, from parsing, abstract syntax tree, emitting code, etc. Hack it.

The LCC compiler (wikipedia) (project homepage) of Fraser and Hanson is described in their book "A Retargetable C Compiler: Design and Implementation". It is quite readable and explains the whole compiler, down to code generation.

Sorry, it is in Spanish, but this is the bibliography of a course called "Compiladores e Intérpretes" (Compilers and Interpreters) in Argentina.

The course was from formal language theory to compiler construction, and these are the topics you need to build, at least, a simple compiler:

  • Compilers Design in C.
    Allen I. Holub

    Prentice-Hall. 1990.

  • Compiladores. Teoría y Construcción.
    Sanchís Llorca, F.J. , Galán Pascual, C. Editorial Paraninfo. 1988.

  • Compiler Construction.
    Niklaus Wirth

    Addison-Wesley. 1996.

  • Lenguajes, Gramáticas y Autómatas. Un enfoque práctico.
    Pedro Isasi Viñuela, Paloma Martínez Fernández, Daniel Borrajo Millán. Addison-Wesley Iberoamericana (España). 1997.

  • The art of compiler design. Theory and practice.
    Thomas Pittman, James Peters.

    Prentice-Hall. 1992.

  • Object-Oriented Compiler Construction.
    Jim Holmes.
    Prentice Hall, Englewood Cliffs, N.J. 1995

  • Compiladores. Conceptos Fundamentales.
    B. Teufel, S. Schmidt, T. Teufel.

    Addison-Wesley Iberoamericana. 1995.

  • Introduction to Automata Theory, Languages, and Computation.

    John E. Hopcroft. Jeffref D. Ullman.
    Addison-Wesley. 1979.

  • Introduction to formal languages.
    György E. Révész.

    Mc Graw Hill. 1983.

  • Parsing Techniques. A Practical Guide.
    Dick Grune, Ceriel Jacobs.
    Impreso por los autores. 1995
    http://www.cs.vu.nl/~dick/PTAPG.html

  • Yacc: Yet Another Compiler-Compiler.
    Stephen C. Johnson
    Computing Science Technical Report Nº 32, 1975. Bell Laboratories. Murray Hill, New
    Jersey.

  • Lex: A Lexical Analyzer Generator.
    M. E. Lesk, E. Schmidt. Computing Science Technical Report Nº 39, 1975. Bell Laboratories. Murray Hill, New Jersey.

  • lex & yacc.
    John R. Levine, Tony Mason, Doug Brown.
    O’Reilly & Associates. 1995.

  • Elements of the theory of computation.
    Harry R. Lewis, Christos H. Papadimitriou. Segunda Edición. Prentice Hall. 1998.

  • Un Algoritmo Eficiente para la Construcción del Grafo de Dependencia de Control.
    Salvador V. Cavadini.
    Trabajo Final de Grado para obtener el Título de Ingeniero en Computación.
    Facultad de Matemática Aplicada. U.C.S.E. 2001.

  1. This is a vast subject. Do not underestimate this point. And do not underestimate my point to not underestimate it.
  2. I hear the Dragon Book is a (the?) place to start, along with searching. :) Get better at searching, eventually it will be your life.
  3. Building your own programming language is absolutely a good exercise! But know that it will never be used for any practical purpose in the end. Exceptions to this are few and very far between.

Not a book, but a technical paper and an enormously fun learning experience if you want to know more about compilers (and metacompilers)... This website walks you through building a completely self-contained compiler system that can compile itself and other languages:

Tutorial: Metacompilers Part 1

This is all based on an amazing little 10-page technical paper:

Val Schorre META II: A Syntax-Oriented Compiler Writing Language

from honest-to-god 1964. I learned how to build compilers from this back in 1970. There's a mind-blowing moment when you finally grok how the compiler can regenerate itself....

I know the website author from my college days, but I have nothing to do with the website.

There's a lot of good answers here, so i thought I'd just add one more to the list:

I got a book called Project Oberon more than a decade ago, which has some very well written text on the compiler. The book really stands out in the sense that the source and explanations is very hands on and readable. The complete text (the 2005 edition) has been made available in pdf, so you can download right now. The compiler is discussed in chapter 12:

http://www-old.oberon.ethz.ch/WirthPubl/ProjectOberon.pdf

Niklaus Wirth, Jürg Gutknecht

(The treatment is not as extensive as his book on compilers)

I've read several books on compilers, and i can second the dragon book, time spent on this book is very worthwhile.

If you are interested in writing a compiler for a functional language (rather than a procedural one) Simon Peyton-Jones and David Lester's "Implementing functional languages: a tutorial" is an excellent guide.

The conceptual basics of how functional evaluation works is guided by examples in a simple but powerful functional language called "Core". Additionally, each part of the Core language compiler is explained with code examples in Miranda (a pure functional language very similar to Haskell).

Several different types of compilers are described but even if you only follow the so-called template compiler for Core you will have an excellent understanding of what makes functional programming tick.

I liked the Crenshaw tutorial too, because it makes it absolutely clear that a compiler is just another program that reads some input and writes some out put.

Read it.

Work it if you want, but then look at another reference on how bigger and more complete compilers are really written.

And read On Trusting Trust, to get a clue about the unobvious things that can be done in this domain.

You can use BCEL by the Apache Software Foundation. With this tool you can generate assembler-like code, but it's Java with the BCEL API. You can learn how you can generate intermediate language code (in this case byte code).

Simple example

  1. Create a Java class with this function:

    public String maxAsString(int a, int b) {
        if (a > b) {
            return Integer.valueOf(a).toString();
        } else if (a < b) {
            return Integer.valueOf(b).toString();
        } else {
            return "equals";
        }
    }
    

Now run BCELifier with this class

BCELifier bcelifier = new BCELifier("MyClass", System.out);
bcelifier.start();

You can see the result on the console for the whole class (how to build byte code MyClass.java). The code for the function is this:

private void createMethod_1() {
  InstructionList il = new InstructionList();
  MethodGen method = new MethodGen(ACC_PUBLIC, Type.STRING, new Type[] { Type.INT, Type.INT }, new String[] { "arg0", "arg1" }, "maxAsString", "MyClass", il, _cp);

  il.append(InstructionFactory.createLoad(Type.INT, 1)); // Load first parameter to address 1
  il.append(InstructionFactory.createLoad(Type.INT, 2)); // Load second parameter to adress 2
    BranchInstruction if_icmple_2 = InstructionFactory.createBranchInstruction(Constants.IF_ICMPLE, null); // Do if condition (compare a > b)
  il.append(if_icmple_2);
  il.append(InstructionFactory.createLoad(Type.INT, 1)); // Load value from address 1 into the stack
  il.append(_factory.createInvoke("java.lang.Integer", "valueOf", new ObjectType("java.lang.Integer"), new Type[] { Type.INT }, Constants.INVOKESTATIC));
  il.append(_factory.createInvoke("java.lang.Integer", "toString", Type.STRING, Type.NO_ARGS, Constants.INVOKEVIRTUAL));
  il.append(InstructionFactory.createReturn(Type.OBJECT));
  InstructionHandle ih_13 = il.append(InstructionFactory.createLoad(Type.INT, 1));
  il.append(InstructionFactory.createLoad(Type.INT, 2));
    BranchInstruction if_icmpge_15 = InstructionFactory.createBranchInstruction(Constants.IF_ICMPGE, null); // Do if condition (compare a < b)
  il.append(if_icmpge_15);
  il.append(InstructionFactory.createLoad(Type.INT, 2));
  il.append(_factory.createInvoke("java.lang.Integer", "valueOf", new ObjectType("java.lang.Integer"), new Type[] { Type.INT }, Constants.INVOKESTATIC));
  il.append(_factory.createInvoke("java.lang.Integer", "toString", Type.STRING, Type.NO_ARGS, Constants.INVOKEVIRTUAL));
  il.append(InstructionFactory.createReturn(Type.OBJECT));
  InstructionHandle ih_26 = il.append(new PUSH(_cp, "equals")); // Return "equals" string
  il.append(InstructionFactory.createReturn(Type.OBJECT));
  if_icmple_2.setTarget(ih_13);
  if_icmpge_15.setTarget(ih_26);
  method.setMaxStack();
  method.setMaxLocals();
  _cg.addMethod(method.getMethod());
  il.dispose();
}

Not included in the list so far is this book:

Basics of Compiler Design (Torben Mogensen) (from the dept. of Computer Science, University of Copenhagen)

I'm also interested in learning about compilers and plan to enter that industry in the next couple of years. This book is the ideal theory book to begin learning compilers as far as I can see. It's FREE to copy and reproduce, cleanly and carefully written and gives it to you in plain English without any code but still presents the mechanics by way of instructions and diagrams etc. Worth a look imo.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top