Split stacks unneccesary on amd64

Question 1

Yes, that's what they are saying.

Split stacks are (currently) unnecessary on 64bit architectures because the 64bit virtual address space is so large it can contain millions of stack address ranges, each as large as an entire 32bit address space, if needed.

In the Flat memory model in use nowadays, the translation from virtual addresses to phisical memory locations is done with the support of the hardware MMU. On amd64 it turns out it's better (meaning, overall faster) to reserve big chunks of the 64bit virtual address space to each new stack you are creating, while only mapping the first page (4kB) to actual RAM. This way, the stack will be able to grow and shrink as needed, over contiguous virtual addresses (meaning less code in each function prologue, a big optimization) while the OS re-configures the MMU to map each page of virtual addresses to an actual free page of RAM, whenever the stack grows or shrinks above/below some configurable thresholds.

By choosing the thresholds smartly (see for example the theory of dynamic arrays) you can achieve O(1) complexity on the average stack operation, while retaining the benefits of millions of stacks that can grow as much as you need and only consume the memory they use.

PS: the current Go implementation is far behind any of this :-)

Question 2

The Go core team is currently discussing the possibility of using contiguous stacks in a future Go version.

The split stack approach is useful because stacks can grow more flexibly but it also requires that the runtime allocates a relatively big chunk of memory to distribute these stacks across. There has been a lot of confusion about Go's memory usage, in part because of this.

Making contiguous but growable (relocatable) stacks is an option that would provide the same flexibility and maybe reduce the confusion about Go's memory usage. As well as remedying some ill corner-cases on low-memory machines (see linked thread).

As to advantages/disadvantages on 32-bit vs. 64-bit architectures, I don't think there are any directly associated solely with the use of segmented stacks.

Question 3

Update Go 1.4 (Q4 2014)

Change to the runtime:

Up to Go 1.4, the runtime (garbage collector, concurrency support, interface management, maps, slices, strings, ...) was mostly written in C, with some assembler support.
In 1.4, much of the code has been translated to Go so that the garbage collector can scan the stacks of programs in the runtime and get accurate information about what variables are active.

This rewrite allows the garbage collector in 1.4 to be fully precise, meaning that it is aware of the location of all active pointers in the program. This means the heap will be smaller as there will be no false positives keeping non-pointers alive. Other related changes also reduce the heap size, which is smaller by 10%-30% overall relative to the previous release.

A consequence is that stacks are no longer segmented, eliminating the "hot split" problem. When a stack limit is reached, a new, larger stack is allocated, all active frames for the goroutine are copied there, and any pointers into the stack are updated.

Initial answer (March 2014)

The article "Contiguous stacks in Go" by Agis Anastasopoulo also addresses this issue

In such cases where the stack boundary happens to fall in a tight loop, the overhead of creating and destroying segments repeatedly becomes significant.
This is called the “hot split” problem inside the Go community.

The “hot split” will be addressed in Go 1.3 by implementing contiguous stacks.

Now when a stack needs to grow, instead of allocating a new segment the following happens:

Create a new, somewhat larger stack

Copy the contents of the old stack to the new stack

Re-adjust every copied pointer to point to the new addresses

Destroy the old stack

The following mention one problem seen mainly in 32-bit arhcitectures:

There is a certain challenge though.
The 1.2 runtime doesn’t know if a pointer-sized word in the stack is an actual pointer or not. There may be floats and most rarely integers that if interpreted as pointers, would actually point to data.

Due to the lack of such knowledge the garbage collector has to conservatively consider all the locations in the stack frames to be roots. This leaves the possibility for memory leaks especially on 32-bit architectures since their address pool is much smaller.

When copying stacks however, such cases have to be avoided and only real pointers should be taken into account when re-adjusting.

Work was done though and information about live stack pointers is now embedded in the binaries and is available to the runtime.
This means not only that the collector in 1.3 can precisely stack data but re-adjusting stack pointers is now possible.