To start with, a point of terminology: "garbage collection" means different things to different people, and some GC schemes are more sophisticated than others. Some people consider reference counting to be a form of GC, but personally I consider "true GC" to be distinct from reference counting.
With refcounts, there is an integer tracking the number of references, and you can trigger deallocation immediately when the refcount hits zero. This us how the CPython implementation works, and how most varieties of C++ smart pointers work. The CPython implementation adds a mark/sweep GC as a backup, so it's very much like the hybrid design you describe.
But refcounting is actually a pretty terrible solution, since it incurs a (relatively) expensive memory write (plus a memory barrier and/or lock, to ensure thread safety) every time a reference is passed, which happens quite a lot. In imperative languages like C++ it's possible (just difficult) to manage memory ownership through macros and coding conventions, but in functional languages like Lisp it becomes well-nigh impossible, because memory allocation usually happens implicitly due to local variable capture in a closure.
So it should come as no surprise that the first step toward a modern GC was invented for Lisp. It was called the "twospace allocator" or "twospace collector" and it worked exactly like it sounds: it divided allocatable memory (the "heap") into two spaces. Every new object was allocated out of the first space until it got too full, at which point allocation would stop and the runtime would walk the reference graph and copy only the live (still referenced) objects to the second space. After the live objects were copied, the first space would be marked empty, and allocation would resume, allocating new objects from the second space, until it got too full, at which point the live objects would be copied back to the first space and the process would start all over again.
The advantage of the twospace collector is that, instead of doing O(N)
work, where N is the total number of garbage objects, it would only do O(M)
work, where M is the number of objects that were not garbage. Since in practice, most objects are allocated and then deallocated in a short period of time, this can lead to a substantial performance improvement.
Additionally, the twospace collector made it possible to simplify the allocator side as well. Most malloc()
implementations maintain what is called a "free list": a list of which blocks are still available to be allocated. To allocate a new object, malloc()
must scan the free list looking for an empty space that's big enough. But the twospace allocator didn't bother with that: it just allocated objects in each space like a stack, by just pushing a pointer up by the desired number of bytes.
So the twospace collector was much faster than malloc()
, which was great because Lisp programs would do a lot more allocations than C programs would. Or, to put it another way, Lisp programs needed a way to allocate memory like a stack but with a lifetime that was not limited to the execution stack -- in other words, a stack that could grow infinitely without the program running out of memory. And, in fact, Raymond Chen argues that that's exactly how people should think about GC. I highly recommend his series of blog posts starting with Everybody thinks about garbage collection the wrong way.
But the twospace collector had a major flaw, which is that no program could ever use more than half the available RAM: the other half was always wasted. So the history of GC techniques is the history of attempts to improve on the twospace collector, usually by using heuristics of program behavior. However, GC algorithms inevitably involve tradeoffs, usually preferring to deallocate objects in batches instead of individually, which inevitably leads to delays where objects aren't deallocated immediately.
Edit: To answer your follow-up question, modern GCs generally incorporate the idea of generational garbage collection, where objects are grouped into different "generations" based on lifetime, and an object in one generation gets "promoted" to another generation once it's lived long enough. Sometimes a small difference in object lifetime (e.g. in a request-driven server, storing an object for longer than one request) can lead to a large difference in the amount of time it takes before the object gets deallocated, since it causes it to become more "tenured".
You correctly observe that a true GC has to operate "beneath" the level of malloc()
and free()
. (As a side note, it's worth learning about how malloc()
and free()
are implemented -- they aren't magic either!) Additionally, for an effective GC, you either need to be conservative (like the Boehm GC) and never move objects, and check things that might be pointers, or else you need some kind of "opaque pointer" type -- which Java and C# call "references". Opaque pointers are actually great for an allocation system, since it means you can always move objects by updating pointers to them. In a language like C where you interact directly with raw memory addresses, it's never really safe to move objects.
And there are multiple options for GC algorithms. The standard Java runtime contains no less than five collectors (Young, Serial, old CMS, new CMS, and G1, although I think I'm forgetting one) and each has a set of options that are all configurable.
However, GCs aren't magic. Most GCs are just exploiting the time-space tradeoff of batching work, which means that the gains in speed are usually paid for in increased memory usage (compared to manual memory management or refcounting). But the combination of increased program performance and increased programmer performance, versus the low cost of RAM these days, makes the tradeoff usually worth it.
Hopefully that helps make things clearer!