Question

I don't know very well Python. I'm trying to understand more precisely what exact features of dynamic languages (à la Python, Lua, Scheme, Perl, Ruby, ....) are forcing their implementations to be slow.

As a case in point, Lua 5.3 metatable machinery would intuitively make Lua quite slow, but in practice Lua is rumored to be quite fast (and faster than Python is).

Also, I have the intuition (perhaps a wrong one) that since on current processors memory is much slower than raw computation (a memory access with a cache miss needs the same time as hundreds of arithmetic operations), dynamic type checking (à la if (value->type != INTEGER_TAG) return; in C parlance) could run quite fast.

Of course, whole program analysis (like Stalin Scheme implementation is doing) can make a dynamic language implementation as a translator runs fast, but let's pretend I don't have time to design a whole program analyzer at first.

(I'm sort of designing a dynamic language in my MELT monitor, and some of it would be translated to C)

Was it helpful?

Solution

What semantic features of Python (and other dynamic languages) contribute to its slowness?

None.

Performance of language implementations is a function of money, resources, and PhD theses, not language features. Self is much more dynamic than Smalltalk and slightly more dynamic than Python, Ruby, ECMAScript, or Lua, and it had a VM that outperformed all existing Lisp and Smalltalk VMs (in fact, the Self distribution shipped with a small Smalltalk interpreter written in Self, and even that was faster than most existing Smalltalk VMs), and was competitive with, and sometimes even faster than C++ implementations of the time.

Then, Sun stopped funding Self, and IBM, Microsoft, Intel, and Co. started funding C++, and the trend reversed. The Self developers left Sun to start their own company, where they used the technology developed for the Self VM to build one of the fastest Smalltalk VMs ever (the Animorphic VM), and then Sun bought back that company, and a slightly modified version of that Smalltalk VM is now better known under the name of "HotSpot JVM". Ironically, Java programmers look down on dynamic languages for being "slow", when in fact, Java was slow until it adopted dynamic language technology. (Yes, that's right: the HotSpot JVM is essentially a Smalltalk VM. The bytecode verifier does a lot of type checking, but once the bytecode is accepted by the verifier, the VM, and especially the optimizer and the JIT don't actually do much of interest with the static types!)

CPython simply doesn't do a lot of the stuff that makes dynamic languages (or rather dynamic dispatch) fast: dynamic compilation (JIT), dynamic optimization, speculative inlining, adaptive optimization, dynamic de-optimization, dynamic type feedback / inference. There's also the problem that almost the entire core and standard library is written in C, which means that even if you make Python 100x faster all of a sudden, it won't help you much, because something like 95% of code executed by a Python program is C, not Python. If everything were written in Python, even moderate speedups would create avalanche an effect, where the algorithms get faster, and the core datastructures get faster, but of course the core data structures are also used within the algorithms, and the core algorithms and core data structures are used everywhere else, and so on …

There are a couple of things that are notoriously bad for memory-managed OO languages (dynamic or not) in today's systems. Virtual Memory and Memory Protection can be a killer for garbage collection performance in particular, and system performance in general. And it is completely unnecessary in a memory-safe language: why protect against illegal memory accesses when there aren't any memory accesses in the language to begin with? Azul have figured out to use modern powerful MMUs (Intel Nehalem and newer, and AMD's equivalent) to help garbage collection instead of hindering it, but even though it is supported by the CPU, the current memory subsystems of mainstream OS's aren't powerful enough to allow this (which is why Azul's JVM actually runs virtualized on the bare metal besides the OS, not within it).

In the Singularity OS project, Microsoft have measured an impact of ~30% on system performance when using MMU protection instead of the type system for process separation.

Another thing Azul noticed when building their specialized Java CPUs was that modern mainstream CPUs focus on the completely wrong thing when trying to reduce the cost of cache misses: they try to reduce the number of cache misses through such things as branch prediction, memory prefetching, and so on. But, in a heavily polymorphic OO program, the access patterns are basically pseudo-random, there simply is nothing to predict. So, all of those transistors are just wasted, and what one should do instead is reducing the cost of every individual cache miss. (The total cost is #misses * cost, mainstream tries to bring the first down, Azul the second.) Azul's Java Compute Accelerators could have 20000 concurrent cache misses in flight and still make progress.

When Azul started, they thought they would take some off-the-shelf I/O components and design their own specialized CPU core, but what they actually ended up needing to do was the exact opposite: they took a rather standard off-the-shelf 3-address RISC core and designed their own memory controller, MMU, and cache subsystem.

tl;dr: The "slowness" of Python is not a property of the language but a) its naive (primary) implementation, and b) the fact that modern CPUs and OSs are specifically designed to make C run fast, and the features they have for C are either not helping (cache) or even actively hurting (virtual memory) Python performance.

And you can insert pretty much any memory-managed language with dynamic ad-hoc polymorphism here … when it comes to the challenges of an efficient implementation, even Python and Java are pretty much "the same language".

OTHER TIPS

While Python's current implementation (which lacks a lot of the optimisations performed by other dynamic languages, e.g. modern Javascript implementations and, as you point out, Lua) is a source of most of its problems, it does have some semantic issues that would make it difficult for an implementation to compete with other languages, at least in certain fields. Some that are worth particularly considering:

  • List and dictionary operations are required by the language definition to be atomic. This means that unless a JIT compiler is able to prove that no reference to a list object has escaped its current thread (an analysis that is difficult in many cases and impossible in the general case) it must ensure access to the object is serialised (e.g. via locking). The CPython implementation solves this issue by using the notorious "global interpreter lock" which prevents Python code being effectively used in multiprocessing environments with multi-thread techniques, and while other solutions are possible they all have performance problems.

  • Python has no mechanism for specifying the use of value objects; everything is handled by reference, adding extra indirection where it is not necessarily required. While it is possible for a JIT compiler to infer value objects in some cases and automatically optimize this, it is not possible to do so generally and therefore code that is not carefully written to ensure optimization is possible (which is somewhat of a black art) will suffer.

  • Python has an eval function, which means that a JIT compiler cannot make assumptions about actions not occurring, even if it performs a whole-program analysis, as long as eval is used once. For example, a Python compiler cannot assume that a class has no subclasses and therefore devirtualize method calls, because that assumption could later be negated via a call to eval. It must instead perform dynamic type checks to ensure that assumptions made by native-compiled code have not been invalidated prior to execution of that code.

Licensed under: CC-BY-SA with attribution
scroll top