Advantages of Binary Search Trees over Hash Tables

https://stackoverflow.com/questions/4128546

29-09-2019
|

Question

What are the advantages of binary search trees over hash tables?

Hash tables can look up any element in Theta(1) time and it is just as easy to add an element....but I'm not sure of the advantages going the other way around.

Solution

Remember that Binary Search Trees (reference-based) are memory-efficient. They do not reserve more memory than they need to.

For instance, if a hash function has a range R(h) = 0...100, then you need to allocate an array of 100 (pointers-to) elements, even if you are just hashing 20 elements. If you were to use a binary search tree to store the same information, you would only allocate as much space as you needed, as well as some metadata about links.

OTHER TIPS

One advantage that no one else has pointed out is that binary search tree allows you to do range searches efficiently.

In order to illustrate my idea, I want to make an extreme case. Say you want to get all the elements whose keys are between 0 to 5000. And actually there is only one such element and 10000 other elements whose keys are not in the range. BST can do range searches quite efficiently since it does not search a subtree which is impossible to have the answer.

While, how can you do range searches in a hash table? You either need to iterate every bucket space, which is O(n), or you have to look for whether each of 1,2,3,4... up to 5000 exists. (what about the keys between 0 and 5000 are an infinite set? for example keys can be decimals)

One "advantage" of a binary tree is that it may be traversed to list off all elements in order. This is not impossible with a Hash table but is not a normal operation one design into a hashed structure.

In addition to all the other good comments:

Hash tables in general have better cache behavior requiring less memory reads compared to a binary tree. For a hash table you normally only incur a single read before you have access to a reference holding your data. The binary tree, if it is a balanced variant, requires something in the order of k * lg(n) memory reads for some constant k.

On the other hand, if an enemy knows your hash-function the enemy can enforce your hash table to make collisions, greatly hampering its performance. The workaround is to choose the hash-function randomly from a family, but a BST does not have this disadvantage. Also, when the hash table pressure grows too much, you often tend to enlargen and reallocate the hash table which may be an expensive operation. The BST has simpler behavior here and does not tend to suddenly allocate a lot of data and do a rehashing operation.

Trees tend to be the ultimate average data structure. They can act as lists, can easily be split for parallel operation, have fast removal, insertion and lookup on the order of O(lg n). They do nothing particularly well, but they don't have any excessively bad behavior either.

Finally, BSTs are much easier to implement in (pure) functional languages compared to hash-tables and they do not require destructive updates to be implemented (the persistence argument by Pascal above).

The main advantages of a binary tree over a hash table is that the binary tree gives you two additional operations you can't do (easily, quickly) with a hash table

find the element closest to (not necessarily equal to) some arbitrary key value (or closest above/below)
iterate through the contents of the tree in sorted order

The two are connected -- the binary tree keeps its contents in a sorted order, so things that require that sorted order are easy to do.

A (balanced) binary search tree also has the advantage that its asymptotic complexity is actually an upper bound, while the "constant" times for hash tables are amortized times: If you have a unsuitable hash function, you could end up degrading to linear time, rather than constant.

A hashtable would take up more space when it is first created - it will have available slots for the elements that are yet to be inserted (whether or not they are ever inserted), a binary search tree will only be as big as it needs to be. Also, when a hash-table needs more room, expanding to another structure could be time-consuming, but that might depend on the implementation.

A binary search tree can be implemented with a persistent interface, where a new tree is returned but the old tree continues to exist. Implemented carefully, the old and new trees shares most of their nodes. You cannot do this with a standard hash table.

A binary tree is slower to search and insert into, but has the very nice feature of the infix traversal which essentially means that you can iterate through the nodes of the tree in a sorted order.

Iterating through the entries of a hash table just doesn't make a lot of sense because they are all scattered in memory.

BSTs also provide the "findPredecessor" and "findSuccessor" operations (To find the next smallest and next largest elements) in O(logn) time, which might also be very handy operations. Hash Table can't provide in that time efficiency.

From Cracking the Coding Interview, 6th Edition

We can implement the hash table with a balanced binary search tree (BST) . This gives us an O(log n) lookup time. The advantage of this is potentially using less space, since we no longer allocate a large array. We can also iterate through the keys in order, which can be useful sometimes.

If you want to access the data in a sorted manner, then a sorted list has to be maintained in parallel to the hash table. A good example is Dictionary in .Net. (see http://msdn.microsoft.com/en-us/library/3fcwy8h6.aspx).

This has the side-effect of not only slowing inserts, but it consumes a larger amount of memory than a b-tree.

Further, since a b-tree is sorted, it is simple to find ranges of results, or to perform unions or merges.

It also depends on the use, Hash allows to locate exact match. If you want to query for a range then BST is the choice. Suppose you have a lots of data e1, e2, e3 ..... en.

With hash table you can locate any element in constant time.

If you want to find range values greater than e41 and less than e8, BST can quickly find that.

The key thing is the hash function used to avoid a collision. Of course, we cannot totally avoid a collision, in which case we resort to chaining or other methods. This makes retrieval no longer constant time in worst cases.

Once full, hash table has to increase its bucket size and copy over all the elements again. This is an additional cost not present over BST.

A hashmap is a set associative array. So, your array of input values gets pooled into buckets. In an open addressing scheme, you have a pointer to a bucket, and each time you add a new value into a bucket, you find out where in the bucket there are free spaces. There are a few ways to do this- you start at the beginning of the bucket and increment the pointer each time and test whether its occupied. This is called linear probing. Then, you can do a binary search like add, where you double the difference between the beginning of the bucket and where you double up or back down each time you are searching for a free space. This is called quadratic probing. OK. Now the problems in both these methods is that if the bucket overflows into the next buckets address, then you need to-

Double each buckets size- malloc(N buckets)/change the hash function- Time required: depends on malloc implementation
Transfer/Copy each of the earlier buckets data into the new buckets data. This is an O(N) operation where N represents the whole data

OK. but if you use a linkedlist there shouldn't be such a problem right? Yes, In linked lists you don't have this problem. Considering each bucket to begin with a linked list, and if you have 100 elements in a bucket it requires you to traverse those 100 elements to reach the end of the linkedlist hence the List.add(Element E) will take time to-

Hash the element to a bucket- Normal as in all implementations
Take time to find the last element in said bucket- O(N) operation.

The advantage of the linkedlist implementation is that you don't need the memory allocation operation and O(N) transfer/copy of all buckets as in the case of the open addressing implementation.

So, the way to minimize the O(N) operation is to convert the implementation to that of a Binary Search Tree where find operations are O(log(N)) and you add the element in its position based on it's value. The added feature of a BST is that it comes sorted!

Hash Tables are not good for indexing. When you are searching for a range, BSTs are better. That's the reason why most database indexes use B+ trees instead of Hash Tables

Binary search trees are good choice to implement dictionary if the keys have some total order (keys are comparable) defined on them and you want to preserve the order information.

As BST preserves the order information, it provides you with four additional dynamic set operations that cannot be performed (efficiently) using hash tables. These operations are:

Maximum
Minimum
Successor
Predecessor

All these operations like every BST operation have time complexity of O(H). Additionally all the stored keys remain sorted in the BST thus enabling you to get the sorted sequence of keys just by traversing the tree in in-order.

In summary if all you want is operations insert, delete and remove then hash table is unbeatable (most of the time) in performance. But if you want any or all the operations listed above you should use a BST, preferably a self-balancing BST.

Binary search trees can be faster when used with string keys. Especially when strings are long.

Binary search trees using comparisons for less/greater which are fast for strings (when they are not equal). So a BST can quickly answer when a string is not found. When it's found it will need to do only one full comparison.

In a hash table. You need to calculate the hash of the string and this means you need to go through all bytes at least once to compute the hash. Then again, when a matching entry is found.

main advantage of hash table is that it does almost all ops in ~=O(1). And its very easy to understand and implement. It does solve many "interview problems" efficiently. So if u want to crack a coding interview, make best friends with hash table ;-)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow