What exactly is the purpose of syntax objects in scheme?

Question

Syntax objects are the repository for lexical context for the underlying Racket compiler. Concretely, when we enter program like:

#lang racket/base
(* 3 4)

The compiler receives a syntax object representing the entire content of that program. Here's an example to let us see what that syntax object looks like:

#lang racket/base

(define example-program 
  (open-input-string
   "
    #lang racket/base
    (* 3 4)
   "))

(read-accept-reader #t)
(define thingy (read-syntax 'the-test-program example-program))
(print thingy) (newline)
(syntax? thingy)

Note that the * in the program has a compile-time representation as a syntax object within thingy. And at the moment, the * in thingy has no idea where it comes from: it has no binding information yet. It's during the process of expansion, during compilation, that the compiler associates * as a reference to the * of #lang racket/base.

We can see this more easily if we interact with things at compile time. (Note: I am deliberately avoiding talking about eval because I want to avoid mixing up discussion of what happens during compile-time vs. run-time.)

Here is an example to let us inspect more of what these syntax objects do:

#lang racket/base
(require (for-syntax racket/base))

;; This macro is only meant to let us see what the compiler is dealing with
;; at compile time.

(define-syntax (at-compile-time stx)
  (syntax-case stx ()
    [(_ expr)
     (let ()
       (define the-expr #'expr)
       (printf "I see the expression is: ~s\n" the-expr)

       ;; Ultimately, as a macro, we must return back a rewrite of
       ;; the input.  Let's just return the expr:
       the-expr)]))


(at-compile-time (* 3 4))

We'll use a macro here, at-compile-time, to let us inspect the state of things during compilation. If you run this program in DrRacket, you will see that DrRacket first compiles the program, and then runs it. As it compiles the program, when it sees uses of at-compile-time, the compiler will invoke our macro.

So at compile-time, we'll see something like:

I see the expression is: #<syntax:20:17 (* 3 4)>

Let's revise the program a little bit, and see if we can inspect the identifier-binding of identifiers:

#lang racket/base
(require (for-syntax racket/base))

(define-syntax (at-compile-time stx)
  (syntax-case stx ()
    [(_ expr)
     (let ()
       (define the-expr #'expr)
       (printf "I see the expression is: ~s\n" the-expr)
       (when (identifier? the-expr)
         (printf "The identifier binding is: ~s\n" (identifier-binding the-expr)))

       the-expr)]))


((at-compile-time *) 3 4)

(let ([* +])
  ((at-compile-time *) 3 4))

If we run this program in DrRacket, we'll see the following output:

I see the expression is: #<syntax:21:18 *>
The identifier binding is: (#<module-path-index> * #<module-path-index> * 0 0 0)
I see the expression is: #<syntax:24:20 *>
The identifier binding is: lexical
12
7

(By the way: why do we see the output from at-compile-time up front? Because compilation is done entirely before runtime! If we pre-compile the program and save the bytecode by using raco make, we would not see the compiler being invoked when we run the program.)

By the time the compiler reaches the uses of at-compile-time, it knows to associate the appropriate lexical binding information to identifiers. When we inspect the identifier-binding in the first case, the compiler knows that it's associated to a particular module (in this case, #lang racket/base, which is what that module-path-index business is about). But in the second case, it knows that it's a lexical binding: the compiler already walked through the (let ([* +]) ...), and so it knows that uses of * refer back to the binding set up by the let.

The Racket compiler uses syntax objects to communicate that kind of binding information to clients, such as our macros.

Trying to use eval to inspect this sort of stuff is fraught with issues: the binding information in the syntax objects might not be relevant, because by the time we evaluate the syntax objects, their bindings might refer to things that don't exist! That's fundamentally the reason you were seeing errors in your experiments.

Still, here is one example that shows the difference between s-expressions and syntax objects:

#lang racket/base

(module mod1 racket/base
  (provide x)
  (define x #'(* 3 4)))

(module mod2 racket/base
  (define * +) ;; Override!
  (provide x)
  (define x  #'(* 3 4)))

;;;;;;;;;;;;;;;;;;;;;;;;;;;

(require (prefix-in m1: (submod "." mod1))
         (prefix-in m2: (submod "." mod2)))

(displayln m1:x)
(displayln (syntax->datum m1:x))
(eval m1:x)

(displayln m2:x)
(displayln (syntax->datum m2:x))
(eval m2:x)

This example is carefully constructed so that the contents of the syntax objects refer only to module-bound things, which will exist at the time we use eval. If we were to change the example slightly,

(module broken-mod2 racket/base
  (provide x)
  (define x  
    (let ([* +])
      #'(* 3 4))))

then things break horribly when we try to eval the x that comes out of broken-mod2, since the syntax object is referring to a lexical binding that doesn't exist by the time we eval. eval is a difficult beast.