Python interpretation model in comparison to direct and virtual machine compilation

https://stackoverflow.com/questions/11267347

18-06-2021
|

Question

I have been compiling diagrams (pun intended) in hope of understanding the different implementations of common programming languages. I understand whether code is compiled or interpreted depends on the implementation of the code, and is not an aspect of the programming language itself.

I am interested in comparing Python interpretation with direct compilation (ex of C++)

and the virtual machine model (ex Java or C#)

In light of these two diagrams above, could you please help me develop a similar flowchart of how the .py file is converted to .pyc, uses the standard libraries (I gather they are called modules) and then actually run. Many programmers on SO indicate that python as a scripting language is not executed by the CPU but rather the interpreter, but that sounds quite impossible because ultimately hardware must be doing the computation.

Solution

First off, this is an implementation detail. I am limiting my answer to CPython and PyPy because I am familiar with them. Answers for Jython, IronPython, and other implementations will differ - probably radically.

Python is closer to the "virtual machine model". Python code is, contrary to the statements of some too-loud-for-their-level-of-knowledge people and despite everyone (including me) conflating it in casual discussion, never interpreted. It is always compiled to bytecode (again, on CPython and PyPy) when it is loaded. If it was loaded because a module was imported and was loaded from a .py file, a .pyc file may be created to cache the compilation output. This step is not mandatory; you can turn it off via various means, and program execution is not affected the tiniest bit (except that the next process to load the module has to do it again). However, the compilation to bytecode is not avoidable, the bytecode is generated in memory if it is not loaded from disk.

This bytecode (the exact details of which are an implementation detail and differ between versions) is then executed, at module level, which entails building function objects, class objects, and the like. These objects simply reuse (hold a pointer to) the bytecode which is already in memory. This is unlike C++ and Java, where code and classes are set in stone during/after compilation. During execution, import statements may be encountered. I lack the space, time and understanding to describe the import machinery, but the short story is:

If it was already imported once, you get that module object (another runtime construct for a thing static languages only have at compile time). A couple of builtin modules (well, all of them in PyPy, for reasons beyond the scope of this question) are already imported before any Python code runs, simply because they are so tightly integrated with the core of the interpreter and so fundamental. sys is such a module. Some Python code may also run beforehand, especially when you start the interactive interpreter (look up site.py).
Otherwise, the module is located. The rules for this are not our concern. In the end, these rules arrive at either a Python file or a dynamically-linked piece of machine code (.DLL on Windows, though Python modules specifically use the extension .pyd but that's just a name; on unix the equivalent .so is used).
- The module is first loaded into memory (loaded dynamically, or parsed and compiled to bytecode).
- Then, the module is initialized. Extension modules have a special function for that which is called. Python modules are simply run, from top to bottom. In well-behaved modules this just sets up global data, defines functions and classes, and imports dependencies. Of course, anything else can also happen. The resulting module object is cached (remember step one) and returned.

All of this applies to standard library modules as well as third party modules. That's also why you can get a confusing error message if you call a script of yours just like a standard library module which you import in that script (it imports itself, albeit without crashing due to caching - one of many things I glossed over).

How the bytecode is executed (the last part of your question) differs. CPython simply interprets it, but as you correctly note, that doesn't mean it magically doesn't use the CPU. Instead, there is a large ugly loop which detects what bytecode instruction shall be executed next, and then jumps to some native code which carries out the semantics of that instruction. PyPy is more interesting; it starts off interpreting but records some stats along the way. When it decides it's worth doing so, it starts recording what the interpreter does in detail, and generates some highly optimized native code. The interpreter is still used for other parts of the Python code. Note that it's the same with many JVMs and possibly .NET, but the diagram you cite glosses over that.

OTHER TIPS

For the reference implementation of python:

(.py) -> python (checks for .pyc) -> (.pyc) -> python (execution dynamically loads modules)

There are other implementations. Most notable are:

jython which compiles (.py) to (.class) and follows the java pattern from there
pypy which employs a JIT as it compiles (.py). the chain from there could vary (pypy could be run in cpython, jython or .net environments)

Python is technically a scripted language but it is also compiled, python source is taken from its source file and fed into the interpreter which often compiles the source to bytecode either internally and then throws it away or externally and saves it like a .pyc

Yes python is a single virtual machine that then sits ontop of the actual hardware but all python bytecode is, is a series of instructions for the pvm (python virtual machine) much like assembler for the actual CPU.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow