How are variables stored in a language compiler or interpreter?

https://softwareengineering.stackexchange.com/questions/355059

17-01-2021
|

题

Say we set a variable in Python.

five = 5

Boom. What I'm wondering is, how is this stored? Does the compiler or interpreter just put it in a variable like so?

varname = ["five"]
varval  = [5]

If this is how it is done, where is that stored? It seems like this could go on forever.

解决方案

Interpreter

An intepreter will work about the way you guessed. In a simple model, it will maintain one dictionary with the variable names as dictionary keys and the variable values as dictionary value. If the language knows the concept of variables that are visible only in specific contexts, the interpreter will maintain multiple dictionaries to reflect the different contexts. The interpreter itself is typically a compiled program, so for its storage, see below.

Compiler

(This depends very much on the language and compiler and is extremely simplified, so it's just meant to give some idea.)

Let's say, we have a global variable int five = 5. A global variable exists only once in the program, so the compiler reserves one memory area of 4 bytes (int size) in a data area. It can use a fixed address, let's say 1234. Into the executable file, the compiler places the info that the four bytes starting at 1234 are needed as static data memory, are to be filled with the number 5 at program start and optionally (for debugger support) the info that the 1234 place is called five and contains an integer. Wherever some other line of code refers to the variable named five, the compiler remembers that it is placed at 1234 and inserts a memory read or write instruction for address 1234.

If int six = 6 is a local variable within a function, it should exist once for every currently active call of this function (there can be multiple because of recursion or multi-threading). So, every function call stacks enough space onto the stack to hold its variables (including four bytes for our six variable. The compiler decides where to place the six variable within this stack frame, maybe at 8 bytes from the frame start and remembers that relative position. So, the instructions that the compiler produces for the function, are:

advance the stack pointer by enough bytes for all the local variables of the function.
store the number 6 (initial value of six) into the momory location 8 bytes above the stack pointer.
wherever the function refers to six, the compiler inserts a read or write instruction for the momory location 8 bytes above the stack pointer.
when finished with the function, rewind the stack pointer to its old value.

Once again, that's just a very simplified model, not covering all variable types, but maybe it helps to get an understanding...

其他提示

It depends on the implementation.

For example, a C compiler might maintain a symbol table during compilation. This is a rich data structure that allows pushing and popping of scopes, since each compound-statement opening brace { potentially introduces a new scope for new local variables. In addition to handling scopes coming and going, it records the variables declared, and for each includes the names and their types.

This symbol table data structure also supports looking up a variable's information by name, e.g. by its identifier, and the compiler does this when it binds the declared variable information to raw identifiers it sees in the parse, so this is pretty early on during compilation.

At some point, the compiler assigns locations to the variables. Perhaps location assignments are recorded in the same symbol table data structure. The compiler could do location assignment directly during parsing, but it is likely to be able to do a better job if it waits not just until after parsing, but after general optimization.

At some point, for local variables, the compiler assigns either a stack location or a CPU register (it can be more complex in that the variable can actually have multiple locations, such as a stack location for some parts of the generated code and a CPU register for other sections).

Finally, the compiler generates actual code: machine instructions that references variables' values directly by their CPU registers or assigned stack location, as needed to execute the code being compiled. Each line of source code compiles to its own series of machine code instructions, so the generated instructions encode not only the operations (add, subtract) but also the locations of the variables being referenced.

The final object code that comes out of the compiler no longer has variable names and types; there are only locations, stack locations or CPU registers. Further there is no table of locations, but rather these locations are used by each machine instruction knowing the location where the value of the variable is stored. No looking up of identifiers in the runtime code, each bit of generated code simply knows the operation to perform and the location(s) to use.

When debugging is enabled during compilation, the compiler will output a form of the symbol table so that, for example, debuggers will know the names of the variables at the various stack locations.

Some other languages have the need to lookup identifiers dynamically at runtime, so may also provide some form of symbol table in support of such needs.

Interpreters have a wide range of options. They might maintain a symbol table-like data structure for use during execution (in addition to use during parsing), though instead of assigning/tracking a stack location, simply store the value for the variable, associated with the variable's entry in the symbol table data structure.

A symbol table is perhaps stored in the heap rather than on the stack (though using the stack for scopes and variables is certainly possible, and further it may mimic a stack in the heap to get the cache friendly advantage of packing the variable's values near each other), so an interpreter is probably using heap memory for storing the variable's values whereas a compiler uses stack locations. Generally speaking, the interpreter also does not have the freedom to use CPU registers as storage for variable's values since the CPU registers are otherwise busy running the lines of code of the interpreter itself...

The best way to understand what your code is being compiled into is to compile your code to assembly. The assembly code is closest to the processor instructions that are being executed.

许可以下： CC-BY-SA 和归因

不隶属于 softwareengineering.stackexchange