Question

I am storing 10^9 keys in a BST. Compared to having lets say having multiple BSTs of size 10^6 containing chunk of the bigger tree? Search through all of them executing in parallel.

I am talking about only search performance here, Given that processing power is not a bottle neck.

No correct solution

OTHER TIPS

It depends entirely on your key schema.

For example, let's say your keys are surnames, equally distributed across the twenty-six English letters. If you're looking for Pax Diablo, you can immediately remove 25/26ths of your search space, looking only in the D tree (for Diablo).

With a balanced binary tree, you would have to traverse about 4.7 tree levels on average (log226 is about 4.700439718).

So, yes, it can be more efficient, provided the up-front operation has minimal complexity. In the given example, the selection of one of twenty-six tress is O(1), based on the first character of the name and an array lookup to find the tree.


In the case where the keys are actually numbers from zero to a billion as your comments indicate, you could still have the same efficiency, depending on data distribution. If they're equally distributed (or even close), you could maintain a thousand different trees (from your statement that you want trees of size one million) based on the first three digits of the number and reduce the initial search by a factor of 1000 (about ten tree levels).

Of course, the distribution is important. If all your numbers turn out to be less that a million, they'll all be in the first tree and this scheme will save you nothing (in fact it'll add a useless first step).

Consider using a hash table. Look up for such big keyset should be noticably faster. A hashmap will have constant amortized search complexity as opposed to the logarithmic of a BST.

Also as you are talking about a huge tree here maybe you should take a look at b+ trees.

I doubt the approach you try to take will be more efficient than using the suggestions above. The depth of a binary tree grows very slowly(assuming it is balanced). On the other hand with your approach synchronization when you produce the output will be cumbersome.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top