Question

I've been always concerned with the space usage of a radix tree, but I didn't find any helpful discussions on this.

Now let's say we have a radix tree implementation same as linux radix-tree.c, which takes a integer and use every 6 bits to index the next position in tree. I can easily think of cases where radix tree's space usage is far more than binary search trees. Please correct me if I'm wrong:

Use cases: (0,1,1,1,1), (1,1,1,1,1), (2,1,1,1,1), ... (63,1,1,1,1).

Here just for convenience, I use (a,b,c,d,e) to represent a 30-bit integer key, with each element standing for a 6-bit value. a is MSB and e is LSB.

Radix Tree:

For this use case, radix tree will have a height of 5, and each key will take 4 separate nodes, because they are on different subtrees of root. So there will be ((5-1) * 64 + 1) = 257 nodes.

Each node contains 2^6 = 64 pointers, so it is going to use 257 * 64 * 4Byte = 65KB

Binary Search Tree

We only care how many keys are there. In this case it has 64 keys.

Assume each BST node uses 3 Pointers per Node, it is going to use 64 * 3 * 4Byte = 768 Bytes.

Comparison

Looks radix tree is very space inefficient. It uses ~100 times space than binary search tree given same number of nodes! I don't understand why it is used even in linux kernel.

Am I missing something? Thanks.

Was it helpful?

Solution

You asked for space complexity, so let's work it out.

If we consider a non-null pointer at the leaf to be a value of interest, then it is not hard to prove by contradiction that the worst case is a fully populated tree with one value per leaf node.

If branching is N-way (in your use case 64) and height is H (in your use case 5), there are N^(H-1) leaf nodes in this tree, storing an equal number of values. The total number of nodes is

1 + N + N^2 + ... N^(H-1) = (N^H - 1) / (N-1)

So the storage requirement measured in pointers is N times this amount.

(N^H - 1)  [N / (N-1)]  

This yields a storage efficiency of

(N^H - 1)  [N / (N-1)]  
--------------------
       N^(H-1)

This is the total number of pointers divided by the count of valid data pointers.

As N gets bigger, this approaches N. In your example use case, it's actually 65.01 (for N=64). So we can say the storage complexity is O(NV) where V is the number of data values to be stored.

Though we got here with first-principles analysis, it makes total sense. Storage for the leaf level of the complete tree dominates the rest by a factor of nearly N. The size of that storage is NV.

Of course the advantage of trees with enormous branching factors like this (and e.g. B-trees in databases) is that fewer node traversals are needed to get to the right leaf.

Moreover, when each traversal is a single array lookup as in the radix tree, you can't get much faster.

In your use case, a perfectly balanced binary search tree would require up to 30 comparisons with attendant branches flushing the pipeline. This compared to 5 array indexing operations could be much slower. Array indexing tends to be faster than comparison because it's non-branching code. But even if they are the same, the binary tree would only need 2^5=32 elements to cause the same amount of indexing work as a radix tree containing 2^30 elements.

To generalize this, a binary tree of 2^H elements will require the same lookup effort as an index tree capable of holding N^(H-1) elements if key comparsions and array index operations have the same cost.

As others have said, if the index bits for the top levels of the tree tend to a few common prefixes (i.e. they are the top bits of addresses of the same VM space), the worst case storage behavior of the radix tree does not occur.

OTHER TIPS

Radix tree are used a lot of holding long strings with a common/shared prefixes. in this case the radix tree will be much more economical.

For the sort of data you're specifying it's a different story.

Edit

A good example for long strings with prefixes is storing all file names with full path on your computer. With such data, it will be more economical than the alternatives and be very fast for finding if a file name exists or not. Might even be faster in some cases than a hash table.

Look at these 2 files:

  • "c:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\include\streambuf"
  • "c:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\include\string"

Their shared prefix is: "c:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\include\str", which is stored only once.

The radix tree in Linux originally appeared as a data structure to support the page cache, where such distributions of keys (file offsets) are uncommon.

(FWIW, the initial variant used a splay tree, but Linus said no :)

The radix tree is wide and shallow, so a lookup in it accesses comparatively few different cache lines, which is, obviously, quite beneficial for performance.

It also has the property that locality in page cache accesses means locality in radix tree node accesses, unlike alternative designs like hash table, for example.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top