searching within a Node on a CSB+ tree

https://stackoverflow.com/questions/12292213

30-06-2021
|

Question

I'm reading the paper, making B+-trees cache conscious in main memory. In Section 3.1.2, authors describe several approaches to searching within a CSB+ tree node.

Tha basic approach is to simply do a binary search using a conventional while loop.

The uniform approach is through code expansion, unfolding the while loop into if-then-else statements assuming all the keys are used.

Authors give the following example which exhibits an unfolding of the search for a node with up to 9 keys. The number in a node represents the position of the key being used in an if test

              4
            /   \
           2     6
          / \   / \
         1   3 5   8
                  / \
                 7   9

Then comes the confusing part:

If only 5 keys were actually present, we could traverse this tree with exactly 3 comparisons. On the other hand, an unfolding that put the deepest subtree at the left instead of the right would need 4 comparisons on some branches.

So why would it need more comparisons in the following tree:

              6
            /   \
           4     8
          / \   / \
         2   5 7   9
        / \
       1   3

Furthermore,

if we knew we had only five valid keys, we could hardcode a tree that, on average, used 2.67 comparisons rather than 3.

How does 2.67 come about?

Any hints would be appreciated. Also, a link directing me to code expansion knowledge would be helpful.

Actually, I'm not sure whether it's appropriate to ask a question on a paper because some key information may have been left out when transcribed here (the question may need reformatting). I just wish there could be someone who happens to have read the paper.

Thanks

La solution

The key point here is the following quotation from the same section:

we pad all the unused keys (keyList[nKeys..2d-1]) in a node with the largest possible key

Also is is important that in B+/CSB+ tree we search not for node values, but for intervals between these values. A set of possible values is split by 5 keys into 6 intervals.

Since most of the right sub-tree is filled with the largest possible key (L), we need less comparisons than usual:

              4
            /   \
           2     L
          / \   / \
         1   3 5   L
                  / \
                 L   L

Right descendant of root node has largest possible key, there is no need to check any node to the right of it. And exactly 3 comparisons are needed for every interval:

interval   comparisons
up to 1    k>4, k>2, k>1
 1..2      k>4, k>2, k>1
 2..3      k>4, k>2, k>3
 3..4      k>4, k>2, k>3
 4..5      k>4, k>L, k>5
 5..L      k>4, k>L, k>5

If we put the deepest subtree at the left, we have a tree, where non-empty nodes are placed one level deeper:

              L
            /   \
           4     L
          / \   / \
         2   5 L   L
        / \
       1   3

Search for node "1" in this tree requires to compare the key with 4 different nodes: L, 4, 2, and 1.

If we know we have only five valid keys, we have the following tree:

Here we can arrange comparisons in a way, that gives 2.67 comparisons on average:

interval   comparisons
up to 1    k>2, k>1
1..2       k>2, k>1
2..3       k>2, k>4, k>3
3..4       k>2, k>4, k>3
4..5       k>2, k>4, k>5
5..L       k>2, k>4, k>5

"Code expansion" is not a widely used term, so I cannot give you the most relevant link. I think, this is not very different from "Loop unwinding".

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow