Question

How can I search any given txt file for anagrams and display the anagrams for every word in that file.

So far I can read the file, extract every single word and alphabetically sort every single word. I've tried making two dicts one dict containing the actual words in the text file as keys and the alphabetically sorted version of the words as values, and another dict of the dictionary file I have that is set up the same way.

Using both these dictionaries I've been unable to find an efficient way to get the following output for every word in the input list:

'eerst':  steer reste trees

If I try to loop through all the words in the given list, and inside each loop, loop inside the dictionary, looking and recording the anagrams, it takes too much time and is very inefficient. If I try the following:

for x in input_list:
    if x in dictionary:
        print dictionary[x]

I only get the first anagram of every word and nothing else. If that made any sense, any suggestions would be immensely helpful.

Was it helpful?

Solution

I'm not sure if what I'm thinking of is what you're currently doing in your code, but I can't think of anything better:

from collections import defaultdict

words = 'dog god steer reste trees dog fred steer'.split() # or words from a file

unique_words = set(words)
anagram_dict = defaultdict(list)
for word in unique_words:
    key = "".join(sorted(word))
    anagram_dict[key].append(word)

for anagram_list in anagram_dict.values():
    if len(anagram_list) > 1:
        print(*anagram_list)

This will print (in arbitrary order):

god dog
steer trees reste

If you wanted to get the dictionary key value, you could make the final loop be over the items rather than the values of anagram_dict (and you could print out words that don't have any anagrams like 'fred' in the example above, if you wanted). Note that thanks to the set, duplicates of words are not sorted multiple times.

Running time should be O(M + U*N*log(N)) where M is the number of words, U is the number of unique ones and N is their average length. Unless you're sorting an organic chemistry textbook or something else that has lots of long words, it should be pretty close to linear in the length of the input.

OTHER TIPS

Here is another way to get anagrams using itertools.groupby

from itertools import groupby
words = list_of_words

for k, g in groupby(sorted(words, key=sorted), key=sorted):
    g = list(g)
    if len(g) > 1:
        print(g)

The big-O complexity isn't quite as good as the usual dictionary of lists approach, but it's still fairly efficient and it sounds funny when you read it out loud

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top