Domanda

I'm trying to implement a dynamic hash table using chain hashing (each element in the array is a linked list). I want to know, complexity wise, which of the following possibilities is better: 1. I should double the array size when the array is full, meaning each linked list has at least one element. 2. I should double the array size when I have N elements in total (in all of the linked lists) - where N is the array size.

È stato utile?

Soluzione

Plenty of hash table implementations in the wild including several in the C++ standard (unordered_set, unordered_map).

To answer your question, it's better to double up the bin count (internal array) when the element count reaches N. The other way around would be harder and time consuming (finding out if all bins are full).

You need to keep a member that holds the element count.

You don't need to worry yourself about users using a bad hash function like { return 0;}.

Altri suggerimenti

Complexity-wise, they're all the same. Hash table complexity is given as average-case amortized O(1), because hash collisions, once you have a good hashing functions, come down to a matter of luck. And the worst-case complexity of any hash table is O(N), no matter what you do.

That said, useful implementations resize based on load factor, which is the ratio between total elements and number of buckets ("array size"). Waiting until every bucket has at least one entry will cause sub-optimal behavior too often. But a load factor of 1 (N elements in N buckets) is probably too high; most implementations I've seen default to somewhere around 0.7 (7 elements for 10 buckets), and generally let the user configure the load factor (see C++ and Java both). This is trading memory vs speed, and hash tables are often all about speed. In general, only profiling will show the right value for any given program.

Also, the size need not double. Typical vector implementations increase their size by 50% or 70% on each resize, because large-scale testing on real-world applications has shown that to be a better speed/memory trade-off than doubling. It stands to reason that a similar thing would apply to hash tables, although again this is subject to profiling.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top