Forth: How do CREATE and DOES> work exactly?

https://softwareengineering.stackexchange.com/questions/339283

04-01-2021
|

Question

I am in the process of creating my own concatenative language, heavily based on Forth.

I am having a little trouble understanding how the compiling words CREATE and DOES> work, and how they are implemented (How the state of Forth's run time environment changes exactly when they are executed).

I have read the following resources that give a general view, but only of how to use them, and not of how a system implements them:

Forth Primer - CREATE...DOES>
Arrays in Forth
Forth's CREATE...DOES> Maybe I'm amazed?
The sections in the ANS Forth standard (that I had trouble understanding because of the very general terms that are used)

The following things about the behaviour of these two words are unclear to me:

CREATE takes the next (space-delimited) word from the input stream, and creates a new dictionary item for it.
- What happens then?
- Does CREATE fill in anything in the new dictionary item, or not?
- What does CREATE return (on the stack?)?
- Is there anything special that happens to the words between CREATE and DOES>?
DOES> 'fills in' the run time behaviour of the created word.
- What does DOES> consume as input?
- How does it alter the dictionary entry of the CREATE'd word?
- In code snippets like 17 CREATE SEVENTEEN ,, no DOES> is used. Is there some kind of 'default behaviour' that DOES> overrides?

These different unclarities of course all arise from the core problem, that I have trouble understanding what is going on, and how these concepts, that seem rather complex, can be/are implemented in a simple manner in a low-level language like Assembly.

How do CREATE and DOES> work exactly?

Solution

So, I'm a little late to the game, but these questions (particularly about DOES>) were mystifying me as well, being new to Forth. Here is what I've learned and how I've implemented it:

[TL;DR: "CREATE" makes a word with a simple, default behavior. "DOES>" does not return to its caller. Instead, it uses the return address to put a "goto" in the most recent definition.]

CREATE does not take anything from the stack or return anything. It parses a word from the input and makes a dictionary entry for it. It does fill in the code for the newly-created word with standard boilerplate code that pushes an aligned address on the stack and simply returns (the same aligned address that subsequent "," (comma) calls would fill in with data). In my system, the generated code for something like CREATE NewVar would look like this:

NewVar:    push_data  next_addr
           return
next_addr:

Therefore, we could define (initialized) VARIABLE as:

: VARIABLE CREATE 0 , ;

or, in pseudo-machine code:

VARIABLE:  call       CREATE
           push_data  0
           call       comma
           return

Saying something like VARIABLE NewVar would then make NewVar as a word that does the "push_data/return". The 0 , then stores a zero at the address that NewVar puts on the stack -- the "next_addr" shown in the code snippet. Doing things like NewVar @ or 42 NewVar ! then reads and writes that location.

There is nothing special (at least in my system) about the words between CREATE and DOES> or even the words after DOES> in terms of compilation. A word whose definition uses CREATE and DOES> is compiled normally, making sure that DOES> is "call"ed in the compiled code. The special thing that DOES> does is as follows: It finds the code location of the last-created word, and then it overwrites the "return" instruction with a "jump" instruction, the destination of which is the address on the return stack of the DOES> routine. This address is popped off the return stack every time DOES> is called, being used to make "jump" instructions. When DOES> then tries to return to its caller, it is actually returning to whomever called the word that had DOES> in it... not to the remainder of code. My implementation of DOES> looks sort of like this:

DOES>:  [find second opcode of latest definition]
        popr      ; like "R>"
        [overwrite opcode with a "jump" to TOS value]
        return

So, when we define VALUE like this:

: VALUE VARIABLE DOES> @ ;

What we get is something like this:

VALUE: call   VARIABLE
       call   DOES>
       call   fetch  <-- return address of call to DOES>
       return

The code will call our definition of VARIABLE, given above, which in turn calls CREATE to create the new entry. But, when it calls DOES>, DOES> will pop the return address pointed to above, and adjust the definition of NewVar to jump to that location, thus making NewVar so that it pushes "next_addr" on the data stack as before, but now jumps and calls fetch. This also makes execution of VALUE such that it ends at the call to DOES>. When DOES> returns, it returns to the caller of VALUE, not to the remainder of VALUE's definition (@ ;).

Notice that CREATE is not an immediate word. Our definition of VARIABLE was CREATE 0 ,, but that does not create a word named "0", since CREATE is encountered during the definition of VARIABLE... it just gets baked into the definition. Instead, it is when VARIABLE actually executes that CREATE will attempt to retrieve the next word from input and make a new definition for it.

Also notice that DOES> assumes a lot about the most-recently defined word. I could have made it search diligently for the "return" opcode but instead, knowing how CREATE creates a new word, it simply used a fixed offset into that definition. I'm leaning on the spec. that says "An ambiguous condition exists if [ the most recent definition ] was not defined with CREATE ...". In my system, that "ambiguous condition" is that some word gets a "jump" opcode as its second instruction.

What does DOES> consume as input?

It consumes its own return address and also uses a global variable that points to the most recent definition.

Is there some kind of 'default behaviour' that DOES> overrides?

Yes. It is the default behavior ("push_data/return") that CREATE makes.

OTHER TIPS

Answering your questions in order.

CREATE may allocate a new, empty data space. It then sets the "data field" of the new dictionary entry to HERE and the execution semantics to push the value of that "data field". ' foo >BODY will return the "data field" of foo assuming foo was made with CREATE. CREATE doesn't return (or consume) anything on the stack. Nothing special happens with the words between CREATE and DOES>. While ANS Forth only defines what DOES> does in a compilation context and only allows it to operate on a definition created with CREATE, it does not require them to occur within the same definition.

One consequence of the above is HERE CREATE foo foo = may or may not be true, but CREATE foo foo HERE = should always be true.

Answering your questions for DOES>: DOES>, in ANS Forth, is only given meaning in a compilation context. DOES> behaves kind of like ; :NONAME except it won't return an execution token and will instead update the last word that was defined (assuming it was defined with CREATE). It consumes (at compilation time) a "colon-sys" which is basically an opaque, implementation-defined representation of the code being defined. It terminates that definition and creates a new a one. Quoting from the standard, DOES> "[r]eplace[s] the execution semantics of the most recent definition [...] with the name execution semantics given below." The "default behavior" is pushing the value of the "data field". That said, the semantics given begin with pushing the value of the "data field", so, conceptually, it's more like DOES> opens the execution semantics up for extension. I say "opens up for extension" because it's the compilation semantics of the words following DOES> that actually extend the definition.

So there is one main thing to understand about Forth: there are no extended "control" structures; the semantics of a Forth program can be understood by processing it a word at a time. Instead, the words interact with each other by modifying some (effectively global) state. That may be the interpreter mode (e.g. switching to compilation mode), the dictionary, the input buffer, or one of various stacks.

A basic semantics for Forth compilation would have the "colon-sys" just be a pointer to the end of a some code. The compilation semantics of a literal is to append a push statement to that block and move the pointer forward. The compilation semantics of a normal word is to append a call statement to the block and move the pointer forward. ; then appends a return statement into the block and pops the "colon-sys" off the stack. A dictionary entry is just a pointer to a block of code. The interpretation semantics of CREATE is to make a new dictionary entry and append the code to push HERE and return. DOES> simply looks up the code block of the most recently defined dictionary entry and points the pointer for the code block, i.e. the "colon-sys" it pushes, to the return statement (so that it will be overwritten). The compilation semantics of the following words will then update that code block as usual. Many of the restrictions that ANS Forth imposes are to allow a simple bump allocator to be used (assuming separate code and data spaces). For example, nested compilation is not allowed since the code from the nested definition would end up embedded in the code of the outer definition. Similarly, DOES> only operates on the most recently defined definition since its code will be at the end of allocated code space.

Download the book: Thinking Forth here... https://sourceforge.net/projects/thinking-forth/files/reprint/rel-1.0/thinking-forth.pdf/download?use_mirror=gigenet&download=

A cleaner explanation is on the Forth, Inc website: https://www.forth.com/starting-forth/11-forth-compiler-defining-words/

This is Forth's definition of CONSTANT which is a defining word.

: CONSTANT CREATE , >DOES @ ;

CREATE "parses" the input stream at run time capturing the next word from the input stream after CONSTANT and creates the new word it finds after CONSTANT in the dictionary. Then the code between CREATE and >Does is executed. A 16 bit value is stored at in memory with this example. Then, when you execute the new word that CONSTANT created, the code following >DOES executes. So the 16 bit number is returned on to the data stack.

No other language allows you to create new compiler functions like this. C doesn't, and no #define does not count...

It is very easy to create new languages in Forth. Forth is often called: A meta-language because of this feature.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange