List all words in a dictionary that start with <user input>

https://stackoverflow.com/questions/112532

02-07-2019
|

Question

How would a go about making a program where the user enters a string, and the program generates a list of words beginning with that string?

Ex:
User: "abd"
Program:abdicate, abdomen, abduct...

Thanks!

Edit: I'm using python, but I assume that this is a fairly language-independent problem.

Solution

If you on a debian[-like] machine,

#!/bin/bash
echo -n "Enter a word: "
read input
grep "^$input" /usr/share/dict/words

Takes all of 0.040s on my P200.

OTHER TIPS

Use a trie.

Add your list of words to a trie. Each path from the root to a leaf is a valid word. A path from a root to an intermediate node represents a prefix, and the children of the intermediate node are valid completions for the prefix.

One of the best ways to do this is to use a directed graph to store your dictionary. It takes a little bit of setting up, but once done it is fairly easy to then do the type of searches you are talking about.

The nodes in the graph correspond to a letter in your word, so each node will have one incoming link and up to 26 (in English) outgoing links.

You could also use a hybrid approach where you maintain a sorted list containing your dictionary and use the directed graph as an index into your dictionary. Then you just look up your prefix in your directed graph and then go to that point in your dictionary and spit out all words matching your search criteria.

egrep `read input && echo ^$input` /usr/share/dict/words

oh I didn't see the Python edit, here is the same thing in python

my_input = raw_input("Enter beginning of word: ")
my_words = open("/usr/share/dict/words").readlines()
my_found_words = [x for x in my_words if x[0:len(my_input)] == my_input]

If you really want speed, use a trie/automaton. However, something that will be faster than simply scanning the whole list, given that the list of words is sorted:

from itertools import takewhile, islice
import bisect

def prefixes(words, pfx):
    return list(
             takewhile(lambda x: x.startswith(pfx), 
                       islice(words, 
                              bisect.bisect_right(words, pfx), 
                              len(words)))

Note that an automaton is O(1) with regard to the size of your dictionary, while this algorithm is O(log(m)) and then O(n) with regard to the number of strings that actually start with the prefix, while the full scan is O(m), with n << m.

def main(script, name):
    for word in open("/usr/share/dict/words"):
        if word.startswith(name):
            print word,

if __name__ == "__main__":
    import sys
    main(*sys.argv)

If you really want to be efficient - use suffix trees or suffix arrays - wikipedia article.

Your problem is what suffix trees were designed to handle. There is even implementation for Python - here

var words = from word in dictionary
            where word.key.StartsWith("bla-bla-bla");
            select word;

Try using regex to search through your list of words, e.g. /^word/ and report all matches.

If you need to be really fast, use a tree:

build an array and split the words in 26 sets based on the first letter, then split each item in 26 based on the second letter, then again.

So if your user types "abd" you would look for Array[0][1][3] and get a list of all the words starting like that. At that point your list should be small enough to pass over to the client and use javascript to filter.

Most Pythonic solution

# set your list of words, whatever the source
words_list = ('cat', 'dog', 'banana')
# get the word from the user inpuit
user_word = raw_input("Enter a word:\n")
# create an generator, so your output is flexible and store almost nothing in memory
word_generator = (word for word in words_list if word.startswith(user_word))

# now you in, you can make anything you want with it 
# here we just list it :

for word in word_generator :
    print word

Remember generators can be only used once, so turn it to a list (using list(word_generator)) or use the itertools.tee function if you expect using it more than once.

Best way to do it :

Store it into a database and use SQL to look for the word you need. If there is a lot of words in your dictionary, it will be much faster and efficient.

Python got thousand of DB API to help you do the job ;-)

You can use str.startswith(). recording to the official docs:

str.startswith(prefix[, start[, end]])

Return True if string starts with the prefix, otherwise return False. prefix can also be a tuple of prefixes to look for. With optional start, test string beginning at that position. With optional end, stop comparing string at that position.

like below:

user_input = input('Enter something: ')
for word in dictionary:
    if str.startswith(user_input):
        return word

If your dictionary is really big, i'd suggest indexing with a python text index (PyLucene - note that i've never used the python extension for lucene) The search would be efficient and you could even return a search 'score'.

Also, if your dictionary is relatively static you won't even have the overhead of re-indexing very often.

Don't use a bazooka to kill a fly. Use something simple just like SQLite. There are all the tools you need for every modern languages and you can just do :

"SELECT word FROM dict WHERE word LIKE "user_entry%"

It's lightning fast and a baby could do it. What's more it's portable, persistent and so easy to maintain.

Python tuto :

http://www.initd.org/pub/software/pysqlite/doc/usage-guide.html

A linear scan is slow, but a prefix tree is probably overkill. Keeping the words sorted and using a binary search is a fast and simple compromise.

import bisect
words = sorted(map(str.strip, open('/usr/share/dict/words')))
def lookup(prefix):
    return words[bisect.bisect_left(words, prefix):bisect.bisect_right(words, prefix+'~')]

>>> lookup('abdicat')
['abdicate', 'abdication', 'abdicative', 'abdicator']

If you store the words in a .csv file, you can use pandas to solve this rather neatly, and after you have read it once you can reuse the already loaded data frame if the user should be able to perform more than one search per session.

df = pd.read_csv('dictionary.csv')
matching_words = df[0].loc[df[0].str.startswith(user_entry)]

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow