Question

Suppose I have a set of words. For any given word, I would like to find if it is already in the set. What is some efficient data structure and/or algorithm for implementing that?

  1. For example, is the following way using a hash table a good way?

    • first store the set of words, by using some hash function and a hash table.

    • given a query word, calculate its hash value and see if it is in the hash table.

  2. In Python, is there already some data structure and/or algorithm which can be used to implement the way you recommend?

Thanks!

Was it helpful?

Solution

Python has sets. For example:

>>> foo = set()
>>> foo.add('word')
>>> 'word' in foo
True
>>> 'bar' in foo
False

OTHER TIPS

mywords = set(["this", "is", "a", "test"])

"test" in mywords       # => True
"snorkle" in mywords    # => False

Yes, python has a native dictionary data structure that is implemented using a HashTable, and so the in operator is executed in O(1) time on dictionaries. Per Allen Downey in Think Python

The in operator uses different algorithms for lists and dictionaries. For lists, it uses a search algorithm, as in Section 8.6. As the list gets longer, the search time gets longer in direct proportion. For dictionaries, Python uses an algorithm called a hashtable that has a re- markable property: the in operator takes about the same amount of time no matter how many items there are in a dictionary.

Alternatively, if you're building a large set of words overtime and the words aren't too long consider using trie.

http://en.wikipedia.org/wiki/Trie

https://pypi.python.org/pypi/PyTrie

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top