Optimal way of storing sorted tree in memory

https://softwareengineering.stackexchange.com/questions/369995

05-02-2021
|

Pergunta

Wondering what the general structure is for storing a tree in memory most efficiently (for access operation, and space).

For the access operation, it should take advantage of contiguous memory (memory locality) somehow which I am having a hard time determining how that should work for a tree. This way it can take advantage of the cache better during traversal or random subtree access. Also assume only concerned with in-memory structure, not going onto disk or anything. I am specifically thinking of B+-trees, but any BST would probably work I am not sure.

Specifically wondering how to organize the pointers for efficiency.

For example, say you have this JS tree:

var tree = {
  a: {
    b: {
      c: {
        d: {
          e: 100,
          f: 200,
          g: 300
        },
        h: {
          i: 400,
          j: 500,
          k: 600
        }
      }
    }
  }
}

If you imagine that this is actually implemented as a hash table (the keys are hashed to get the value), then what is the optimal layout in memory for the tree. Each of the layers would be it's own hash table, which might have an underlying array as its structure. This means each level's nodes are in the same region as each other, but probably not next to each other.

If instead of a hash table, it was implemented as some sort of linked list, each level could have the children next to each other, but the lists themselves might be in arbitrary places in memory still. The question is, can the lists be organized so they are also close in memory somehow to make things more efficient. Or perhaps the lists have a standard structure/size range, so we can just jump to a spot in memory for faster lookups. That sort of stuff.

Update

Given roughly the limitations of the need for a few million pieces of data and ~500MB of space.

Solução

The optimal layout will almost certainly depend on application. That said, anything that fits as much as possible that will be accessed together into a cache line (typically 64 bytes in current processors) will do the job. For a search tree, assuming you have reasonably small keys, storing a sorted list of as many keys and pointers as you can fit into the cache line is probably the best you can achieve.

Which is to say: don't arbitrarily limit yourself to a binary tree, use the available space in a cache line to minimize the number of cache lines you need to hit. Basically, the same considerations that lead to using B+ trees for on-disk structures, but with a smaller target size.

For string keys, remove the common prefix of all keys in the node, so you can fit more items in. You may need to store variable length keys, so being able to work with a variable number of items in each node is important.

Outras dicas

For the access operation, it should take advantage of contiguous memory (memory locality) somehow which I am having a hard time determining how that should work for a tree.

The generalized solution I use nowadays which isn't necessarily optimal (the optimal solution depends on use case as pointed out, and you can get extremely compact with some further assumptions, but this keeps the assumptions minimal, and also works for N-ary trees and unbalanced ones), but at least achieves locality of reference in nodes frequently accessed together and reasonable compactness is a two-pass approach (the first pass is just concerned with building the tree, the second pass traverses it and optimizes for locality of reference).

The first aspect of the solution is that it does store all the nodes contiguously in one array (ex: std::vector in C++ or ArrayList in Java) for the entire tree. The parent->child links are not stored as 64-bit pointers prone to invalidation (if the array needs to be resized) but 32-bit indices, and halving the size there of the links also tends to translate to more neighboring nodes being able to fit into a cache line.

struct BstNode
{
    ...

    // Child links as indices into the nodes array. -1 is
    // used in place of a null pointer.
    int32_t children[2];
};

// The array storing all the nodes for the tree.
std::vector<BstNode> nodes;

Now depending on how the tree is built, the links could be pointing all over the place into the array, skipping around a lot for your common case traversal patterns.

So I have a second pass inside a copy constructor which creates a copy of the tree, and the copy constructor traverses the tree in a way that aligns with the critical execution paths and inserts the nodes in that traversal order. That has the effect, with this approach, of putting nodes frequently accessed together during a traversal right next to each other in this nodes array.

If the tree is prone to be updated (insertions and removals) often, and I still want to improve spatial locality (I'd need to see some hotspots to consider even doing this for a dynamic use case), then I might periodically do this copy construction pass to optimize its memory layout when the tree is marked as having been changed and swap the optimized copy with the original.

I have a little thing I whipped up for another question using this technique of optimizing linked structures for locality of reference, and again it's not optimal but easy to implement and a very generalized "decent" solution. Here it is in GIF form with half a million agents moving around per frame with collision detection (which involves half a million queries into the spatial index per frame) and the CPU rasterization all happening in one thread at over 100 FPS (something unfortunately not communicated well in the GIF, and the frame rates do drop a bit when I zoom out all the way to see all half a million agents at once, but that's due to the rasterization rather than the link traversal), and its performance is largely the result of this sort of LOR optimization using this "links as indices into one array" approach for the spatial index used for collision detection:

Again I could do better than this (just a quick 2-hour implementation for an SO question and I deliberately wrote the code single-threaded and simple to make it easier to understand) if I made further assumptions that created a more tailored structure, but this is a quick and lazy sort of approach which works for any kind of linked structure. You can also create linked lists this way using an array with indices as links to skip around.

The linked list storing nodes in a contiguous array:

Removing an element from the middle in constant-time:

Inserting a new element:

And you can see the spatial locality of traversal degrades when we do things like insert and remove to/from the middle of the linked list (because constant-time insertions to middle are actually appending to the back of the array), but that spatial locality is quickly restored by simply copying the list which traverses it in order and inserts to the copy in traversal order. It's the same approach here with the doubly-linked list node looking like this:

struct DllNode
{
     ...

     // Using -1 in place of a null.
     int32_t prev;
     int32_t next;
};

// Stores all the nodes for the linked list in one array.
std::vector<DllNode> nodes;

// Stores an index to the first removed node in the array to reclaim
// on subsequent insertions, or -1 if there are no removed (free) nodes.
int32_t free_head;

It's somewhat combining what you'd get with a free list allocator and a separate linked list implementation from the allocator, but combined into one data structure with a much simpler and less error-prone implementation with half the size of the links on 64-bit architectures. These days I barely use custom allocators because I find it so much simpler to just consolidate the concerns of efficient memory allocation and memory layout into the data structure itself without fiddling with allocators (about the only reason I use custom allocators these days is for non-performance-related reasons, like being able to superficially impose a limit on the amount of memory an allocator can allocate to test to make sure out of memory exceptions are handled properly in tests).

Same thing for like a quadtree:

struct QuadTreeNode
{
    ...
    int32_t children[4];
};

// Stores all the nodes for the tree.
std::vector<QuadTreeNode> nodes;

... though for quadtrees I prefer this rep and find I get better performance out of it (even though it can create more nodes than necessary, the nodes end up being much smaller in size):

struct QuadTreeNode
{
    ...
    // All four children are always stored contiguously, so we
    // only need the index of the first child.
    int32_t first_child;
};

And so forth. I use the same approach for any linked structure these days which needs to be reasonably efficient and avoid cache misses all over the place but not like as efficient as possible (I do use much more complex approaches for things like offline rendering which has the most critical performance requirements with its bounding volume hierarchy).

The very first thing you need is one array to store all the nodes since that's going to be a practical requirement as far as memory allocation to have any kind of spatial locality. Then you can just use indices instead of pointers. Finally when you get to that stage, you realize a copy construction pass can just insert nodes in traversal order to get nodes you frequently traversed together right next to each other in memory (in the array) for the copy.

Please consider that the effectiveness of cache memory will be impacted little to not by the way you organize/structure your tree and very much to entirely by the tree's usage scenario.

If you populate a bit and then immediately read the same elements, the cache should serve you well. If you populate, do other stuff and then read (a far more practical scenario), probably not so much because your cache will be flushed with other stuff.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange