Autocomplete Style Prefix Lookup

https://stackoverflow.com/questions/10741413

10-06-2021
|

Question

Making a specific example:

You have a list of every first name in the USA.
You want to autosuggest completions in a GUI.

The obvious thing is to do is use a radix tree to get a list of names for the given prefix. However, this doesn't take into account the frequency information. So, instead of just having the top 5 results be the first lexical results I would like the most common 5 names:

e.g. For the prefix dan

 (5913, 'Daniel')
 (889, 'Danny')
 (820, 'Dana')
 (272, 'Dan')
 (60, 'Dane')

Is there a trie tree algorithm that I've missed? Of course the ideal implementation (if one exists) is in python in my mind.

UPDATE: Generally happy with what Paddy3113 has proposed, though I will say that it blows up completely when I feed it the 2.6GB file which is one of the files I'm reducing. Looking into the details the output gives some insight:

samz;Samzetta|Samzara|Samzie
samza;Samzara
samzar;Samzara
samzara;Samzara
samze;Samzetta
samzet;Samzetta
samzett;Samzetta
samzetta;Samzetta
samzi;Samzie
samzie;Samzie

# Format - PREFIX;"|".join(CHOICES).

We've got a few more days on the bounty side of things, so I'm still looking for the killer solution. Since it's not just about the reduction but also about the lookup side of things.

Solution

Yes, we can use a trie. The most frequent names for a trie node are either (1) the name at that trie node or (2) a most frequent name for a child of the trie node. Here's some Python code to play with.

from collections import defaultdict


class trie:
    __slots__ = ('children', 'freq', 'name', 'top5')

    def __init__(self):
        self.children = defaultdict(trie)
        self.freq = 0
        self.name = None
        self.top5 = []

    def __getitem__(self, suffix):
        node = self
        for letter in suffix:
            node = node.children[letter]
        return node

    def computetop5(self):
        candidates = []
        for letter, child in self.children.items():
            child.computetop5()
            candidates.extend(child.top5)
        if self.name is not None:
            candidates.append((self.freq, self.name))
        candidates.sort(reverse=True)
        self.top5 = candidates[:5]

    def insert(self, freq, name):
        node = self[name]
        node.freq += freq
        node.name = name


root = trie()
with open('letter_s.txt') as f:
    for line in f:
        freq, name = line.split(None, 1)
        root.insert(int(freq.strip()), name.strip())
root.computetop5()
print(root['St'].top5)

OTHER TIPS

Without any idea about tuning, I'd start by assuming I have a list of names and their frequencies then construct a dictionary mapping prefixes to a set of names with that prefix and then turn each set to a list of just the top 5 names w.r.t. frequency.

Using a list of just boys names derived from here massaged to create a text file where every line is an integer frequency of occurrence, some spaces, then a name like this:

8427    OLIVER 
7031    JACK 
6862    HARRY 
5478    ALFIE 
5410    CHARLIE 
5307    THOMAS 
5256    WILLIAM 
5217    JOSHUA 
4542    GEORGE 
4351    JAMES 
4330    DANIEL 
4308    JACOB 
...

The following code constructs the dictionary:

from collections import defaultdict

MAX_SUGGEST = 5

def gen_autosuggest(name_freq_file_name):
    with open(name_freq_file_name) as f:
        name2freq = {}
        for nf in f:
            freq, name = nf.split()
            if name not in name2freq:
                name2freq[name] = int(freq)
    pre2suggest = defaultdict(list)
    for name, freq in sorted(name2freq.items(), key=lambda x: -x[1]):
        # in decreasing order of popularity
        for i, _ in enumerate(name, 1):
            prefix = name[:i]
            pre2suggest[prefix].append((name, name2freq[name]))
    # set max suggestions
    return {pre:namefs[:MAX_SUGGEST]
            for pre, namefs in pre2suggest.items()}

if __name__ == '__main__':
    pre2suggest = gen_autosuggest('2010boysnames_popularity_engwales2.txt')

If you give the dict your prefix it will then return your suggestions (together with their frequencies in this case but those can be discarded if necessary:

>>> len(pre2suggest)
15303
>>> pre2suggest['OL']
[('OLIVER', 8427), ('OLLIE', 1130), ('OLLY', 556), ('OLIVIER', 175), ('OLIWIER', 103)]
>>> pre2suggest['OLI']
[('OLIVER', 8427), ('OLIVIER', 175), ('OLIWIER', 103), ('OLI', 23), ('OLIVER-JAMES', 16)]
>>>

Look no tries :-)

Time Killer

If it takes a long time to run then you might pre-compute the dict and save it to a file then load the pre-computed values when needed using the pickle module:

>>> import pickle
>>> 
>>> savename = 'pre2suggest.pcl'
>>> with open(savename, 'wb') as f:
    pickle.dump(pre2suggest, f)


>>> # restore it
>>> with open(savename, 'rb') as f:
    p2s = pickle.load(f)


>>> p2s == pre2suggest
True
>>>

You could essentially augment a trie implementation to store it's children in popularity order instead of alphabetical, that being said you'd also have to store the popularity in each node of the trie.

Here's an idea on how to do this:

Construct a string trie and store an integer with each node in the tree. This node indicates the number of names that use that node. So you would increment all the nodes of a name when that name is inserted into the trie.

You can then determine the top names by greedily selecting the names with the highest values.

Formally it would be the same as any string trie construction algorithm but with an added step of incrementing integers.

If you want quick lookups, the only real solution is to pre-compute the answers for any given prefix. This is fine in the case that the data doesn't change, but you need a way to keep your load time small.

I'd suggest using a DBM to store the pre-computed dictionary. This is basically a dictionary where the contents are stored on disk and looked up as you reference items. See http://docs.python.org/library/anydbm.html for details. The only downside is that values must be strings, so you'd need to store a comma-separated list of the top 5 entries, say, and split it on lookup.

This will have a much quicker start time than pickle, as the DB doesn't need to be loaded. It's also a lot simpler than using sqlite.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow