Architecture of Google's distributed supervision model

Question

We know relatively little about the internal infrastructure at Google. The only thing you can gleam is by either being employed at Google, or by reading papers.

Google use a model where distribution and supervision happens at the UNIX process level. This makes sense for a number of reasons:

Processes have isolation in UNIX due to the protection from the memory-management-unit.
A crashing process can be restarted, perhaps on another machine.
UNIX is a well-known target.

On top of this, Google builds infrastructure which allows you to "plug in" sequential systems in order to easily make them distributed. The "Chubby lock manager" comes to mind here.

In contrast, Erlangs model is about protection as well, but for light-weight-processes running in the same memory space or by communication over TCP sockets. It provides its own eco-system in which to handle supervision and distribution. Thus while the problems are the same at the surface, the details are different.

The quote also gets a number of things utterly wrong:

Erlang is a safe language in the sense that a program will either progress to compute a value or by faulting with an error, often resulting in a crash of said process. There is no way the program can "go wrong" in the sense of undefined behaviour. Erlang does support a variant of static typing, namely success typing. Type enforcement is entirely at run-time however. Erlang does not have a rich type system, like what some people call "strongly typed".
Erlang has very fast string processing. I don't know where that myth comes from. It takes more knowledge to work with Erlangs string processing, but it has the distinct advantage that it rules out many typical bugs which occur when processing strings in other languages.

The reason nobody answers this question is that it is hard. A google employee probably can't due to leaking of IP. A non-google employee can only point to the relevant papers about their infrastructure.

Suffice to say though, you will need distribution capabilities in any larger system setup today. But the question is "Do you get this by copying what google did 5-10 years ago?"