Why do garbage collectors wait before deallocating?

Question 1

To start with, a point of terminology: "garbage collection" means different things to different people, and some GC schemes are more sophisticated than others. Some people consider reference counting to be a form of GC, but personally I consider "true GC" to be distinct from reference counting.

With refcounts, there is an integer tracking the number of references, and you can trigger deallocation immediately when the refcount hits zero. This us how the CPython implementation works, and how most varieties of C++ smart pointers work. The CPython implementation adds a mark/sweep GC as a backup, so it's very much like the hybrid design you describe.

But refcounting is actually a pretty terrible solution, since it incurs a (relatively) expensive memory write (plus a memory barrier and/or lock, to ensure thread safety) every time a reference is passed, which happens quite a lot. In imperative languages like C++ it's possible (just difficult) to manage memory ownership through macros and coding conventions, but in functional languages like Lisp it becomes well-nigh impossible, because memory allocation usually happens implicitly due to local variable capture in a closure.

So it should come as no surprise that the first step toward a modern GC was invented for Lisp. It was called the "twospace allocator" or "twospace collector" and it worked exactly like it sounds: it divided allocatable memory (the "heap") into two spaces. Every new object was allocated out of the first space until it got too full, at which point allocation would stop and the runtime would walk the reference graph and copy only the live (still referenced) objects to the second space. After the live objects were copied, the first space would be marked empty, and allocation would resume, allocating new objects from the second space, until it got too full, at which point the live objects would be copied back to the first space and the process would start all over again.

The advantage of the twospace collector is that, instead of doing O(N) work, where N is the total number of garbage objects, it would only do O(M) work, where M is the number of objects that were not garbage. Since in practice, most objects are allocated and then deallocated in a short period of time, this can lead to a substantial performance improvement.

Additionally, the twospace collector made it possible to simplify the allocator side as well. Most malloc() implementations maintain what is called a "free list": a list of which blocks are still available to be allocated. To allocate a new object, malloc() must scan the free list looking for an empty space that's big enough. But the twospace allocator didn't bother with that: it just allocated objects in each space like a stack, by just pushing a pointer up by the desired number of bytes.

So the twospace collector was much faster than malloc(), which was great because Lisp programs would do a lot more allocations than C programs would. Or, to put it another way, Lisp programs needed a way to allocate memory like a stack but with a lifetime that was not limited to the execution stack -- in other words, a stack that could grow infinitely without the program running out of memory. And, in fact, Raymond Chen argues that that's exactly how people should think about GC. I highly recommend his series of blog posts starting with Everybody thinks about garbage collection the wrong way.

But the twospace collector had a major flaw, which is that no program could ever use more than half the available RAM: the other half was always wasted. So the history of GC techniques is the history of attempts to improve on the twospace collector, usually by using heuristics of program behavior. However, GC algorithms inevitably involve tradeoffs, usually preferring to deallocate objects in batches instead of individually, which inevitably leads to delays where objects aren't deallocated immediately.

Edit: To answer your follow-up question, modern GCs generally incorporate the idea of generational garbage collection, where objects are grouped into different "generations" based on lifetime, and an object in one generation gets "promoted" to another generation once it's lived long enough. Sometimes a small difference in object lifetime (e.g. in a request-driven server, storing an object for longer than one request) can lead to a large difference in the amount of time it takes before the object gets deallocated, since it causes it to become more "tenured".

You correctly observe that a true GC has to operate "beneath" the level of malloc() and free(). (As a side note, it's worth learning about how malloc() and free() are implemented -- they aren't magic either!) Additionally, for an effective GC, you either need to be conservative (like the Boehm GC) and never move objects, and check things that might be pointers, or else you need some kind of "opaque pointer" type -- which Java and C# call "references". Opaque pointers are actually great for an allocation system, since it means you can always move objects by updating pointers to them. In a language like C where you interact directly with raw memory addresses, it's never really safe to move objects.

And there are multiple options for GC algorithms. The standard Java runtime contains no less than five collectors (Young, Serial, old CMS, new CMS, and G1, although I think I'm forgetting one) and each has a set of options that are all configurable.

However, GCs aren't magic. Most GCs are just exploiting the time-space tradeoff of batching work, which means that the gains in speed are usually paid for in increased memory usage (compared to manual memory management or refcounting). But the combination of increased program performance and increased programmer performance, versus the low cost of RAM these days, makes the tradeoff usually worth it.

Hopefully that helps make things clearer!

Question 2

To understand garbage-collection, go to a bowling alley and watch how the pinsetter removes fallen pins after the first ball has been rolled. Rather than identifying and removing individual fallen pins, the pinsetter mechanism picks up all the pins that are still standing, lifts them to safety, and then runs a sweeper bar across the lane without regard for how many pins are lying there or where they are located. Once that is done, the pins that were standing are placed back on the lane. Many garbage-collection systems work on much the same principle: they have to do a non-trivial amount of work for each live object to ensure it doesn't get destroyed, but dead objects are destroyed wholesale without even being looked at or noticed.

Addendum

A garbage collector that always has to act on every live item to ensure its preservation is apt to be slow when there are a lot of live items; this is why garbage collectors have, historically, gotten a bad rap. The BASIC interpreter on the Commodore 64 (which was, incidentally, written by Microsoft in the days before MS-DOS) would take many seconds to perform a garbage collection in a program which had an array of a few hundred strings. Performance can be improved enormously if items which survive their first garbage collection can be ignored until many items have survived their first garbage collection, and those which have participated in and survived two garbage collections (note that they won't have to participate in their second collection until many other objects have survived their first) can be ignored until many other objects have also participated and survived in their second. This concept can be partially implemented easily (even on the Commodore 64, one could force all strings that exist at a given moment to be exempt from future garbage collection, which could be useful if on startup a program created large arrays of strings that would never change) but becomes more powerful with a little extra hardware support.

If one figures that a garbage collector will try to pack the objects which are going to be kept as close close to an end of memory as it can, generational support requires doing nothing more than keeping track of what (contiguous) range of memory is used by objects of each generation. All objects of every generation must be scanned to make sure all newer-generation live objects are located and preserved, but older-generation objects don't have to be moved, since the memory they occupy isn't in danger of wholesale elimination. This approach is very simple to implement, and can offer some significant performance improvements versus a non-generational GC, but even the scanning phase of a GC can be expensive if there are many live objects.

They key to speed up a "newer-generation" garbage collections is to observe that if an object "Fred" has not been written since the last garbage-collection in which it participated, it cannot possibly contain any references to any objects which have been created since that time. Consequently, none of the objects to which it holds references would be in any danger of elimination until Fred itself is eligible for elimination. Of course, if references to newer objects have been stored in Fred since the last lower-level GC, those references do need to be scanned. To accomplish this, advanced garbage collectors set up hardware traps which fire when parts of the older generation heap are written. When such a trap fires, it adds the objects in that region to a list of older generation objects which will need to be scanned, and then disables the trap associated with that region. In cases where older-generation objects frequently have references to newer objects stored in them, this extra bookkeeping can hurt performance, but in most cases it ends up being a major performance win.

Question 3

Your thoughts are generally very insightful and well considered. You're just missing some basic information.

Garbage collectors deallocate an object when it is no longer in any scope

That is completely incorrect in general. Garbage collectors work at run-time on a representation in which the notion of scope has long since been removed. For example, inlining and applications of liveness analysis destroy scope.

Tracing garbage collectors recycle space at some point after the last reference disappears. Liveness analysis can have references in the stack frame overwritten with other references even if the variable is still in scope because liveness analysis determined that the variable is never used again and, therefore, is no longer needed.

It seems to me that a framework could deallocate as soon as the number of references reaches zero, but all implementations I've encountered wait a while and then deallocate many objects at a time. My question is, why?

Performance. You can reference count at the level of stack entries and registers but performance is absolutely terrible. All practical reference counting garbage collectors defer counter decrements to the end of scope in order to achieve reasonable (but still bad) performance. State-of-the-art reference counting garbage collectors defer decrements in order to batch them up and can allegedly attain competitive performance.

I'm assuming that the framework keeps an integer for each object

Not necessarily. For example, OCaml uses a single bit.

From a programming point of view, it would be nice if objects deallocated immediately after they go out of scope.

From a programming point of view, it would be nice if code ran 10x faster effortlessly.

Note that destructors inhibit tail call elimination which are invaluable in functional programming.

I'm astonished that it's quicker or is the preferred method for other reasons. It sounds like a lot of work.

Consider a program that solves the n-queens problem by manipulating lists of chess board coordinates. The input is a single integer. The output is a list containing a few board coordinates. The intermediate data is a huge spaghetti stack of linked list nodes. If you coded this up by pre-allocating a big enough stack of linked list nodes, manipulating them to get the answer, copy out the (small) answer and then calling free once on the entire stack then you'd be doing almost exactly the same thing that a generational garbage collector does. In particular, you'd only copy ~6% of your data and you'd deallocate the other ~94% with a single call to free.

That was a perfect happy day scenario for a generational garbage collector that adheres to the hypothesis that "most objects die young and old objects rarely refer to new object". A pathological counter example where generational garbage collectors struggle is filling a hash table with freshly allocated objects. The spine of the hash table is a big array that survives so it will be in the old generation. Every new object inserted into it is a backpointer from the old generation to the new generation. Every new object survives. So generational garbage collectors allocate quickly but then mark everything, copy everything and update pointers to everything and, therefore, run ~3x slower than a simple C or C++ solution would.

Not only could we rely on destructors being executed when we want them to be (one of the Python gotchas is that del is not called at a predictable time), but it would become much easier to memory-profile a program

Note that destructors and garbage collection are orthogonal concepts. For example, .NET provides destructors in the form of IDisposable.

FWIW, in ~15 years of using garbage collected languages I have used memory profiling maybe 3 times.

why not a hybrid approach? Deallocate objects as soon as their reference count hits zero and then also do periodic sweeps to look for circular references. Programmers working in such a framework would have a performance/determinism reason to stick to non-circular references as much as is feasible. It's often feasible (e.g. all data are in the form of JSON objects with no pointers to parents). Is this how any popular garbage collectors work?

CPython does that, I believe. Mathematica and Erlang restrict the heap to be a DAG by design so they can use reference counting alone. GC researchers have proposed related techniques such as trial-deletion as an auxiliary algorithm to detect cycles.

Note also that reference counting is theoretically asymptotically faster than tracing garbage collection as its performance is independent of the size of the (live) heap. In practice, tracing garbage collection is still much faster even with 100GB heaps.

Question 4

I think the reason in performance. If you create much objects in a loop and destroy them at the end of a loop step it would take more time to execute that code, then waiting until the program is idle and freeing the data at once. Or on low memory of cause.

Question 5

Where I've come across GC systems they wait until they need to run, so that the relocation of objects still in use can be done once, rather than many times.

Consider a series of objects allocated sequentially in memory:

Object 1
Object 2
Object 3
Object 4
Object 5

If Object 2 can be deallocated, and GC operates immediately, Objects 3,4 and 5 will all need to be moved.

Now consider that object 4 can be deallocated, GC will now move Object 5 next to Object 3. Object 5 has been moved twice

However, if GC waits a short while, both Objects2 and 4 can be removed at the same time, meaning that Object 5 is moved once, and moved further.

Multiply the number of objects by, say, 100 and you can see considerable time savings from this approach

Question 6

@Jim has answered quite some, I will add more to it.

Firstly what makes you think that deallocating[A1] as soon as the count is 0 is a good alternative?

Garbage Collectors not only only deallocate objects but are responsible for the complete memory management. Starting with fragmentation, one of the biggest Issues with garbage collectors. If not done properly, it will result in unnecessary page hits and cache misses. Garbage collectors from the start are designed to handle this Issue. With different generations, it becomes easier to handle this. With A[1], periodically a thread has to set it up and handle it.

Moreover it turns out that clearing multiple objects is faster than doing as in A[1]. (Think of it, for a room with sand spread - it is faster to clear all of them together rather than picking each one of them individually)

Secondly, for thread-safety in Multi-Threaded systems, one will have to hold a lock for every Object to increase/decrease the count which is bad performance and extra memory.Plus Modern collectors have the ability to do it in parallel and not Stop The World (Ex: Java's ParallelGC), i wonder how this can happen with A[1].

Question 7

Garbage collection using reference counting is very slow, especially in a threaded environment.

I really recommend this post by Brian Harry.

A code sample is provided there which more than enough to convince me (C#):

public interface IRefCounted : IDisposable
{
        void AddRef();
}

// ref counted base class.
class RefCountable : IRefCountable
{
        private m_ref;
        public RefCountable()
        {
                m_ref = 1;
        }
        public void AddRef()
        {
                Interlocked.Increment(ref m_ref);
        }
        public void Dispose()
        {
                if (Interlocked.Decrement(ref m_ref) == 0)
                        OnFinalDispose();
        }
        protected virtual void OnFinalDispose()
        {
        }
}

Interlocked.Increment(ref m_ref) is an atomic operation that takes hundreds of memory cycles.