Question

I have a large database for solving crossword puzzles, consisting of a word and a description. My application allows searching for words of a specific length and characters on specific positions (this is done the hard way ... go through all words and check each). Plus a search by description (if necessary)

For instance find word _ _ A _ _ B (6 letter word, third character A and last B)

I would like to index the words in such way that the searching would be really fast. My first idea was to use a balanced tree structure, any other suggestion?

Was it helpful?

Solution

Okay, I am going to propose something weird, but coming from C++ I have been using Boost for a long time and I have come to see the MultiIndex library.

The idea of this library is to create one collection, but have many different ways to query it. It could model, in fact, a database.

So, let's put our words in a table, and put the necessary indexes in place:

word                     |length|c0|c1|c2| ... |c26|
-------------------------|------|--|--|--| ... |---|
Singapour                |9     |S |i |n | ... |0  |

Now the query will look like:

Select word From table Where length=9 And c2='n' And c8='u';

Easy enough isn't it ?

For maximum efficiency, the table should be partitioned on the length, and the indexes (one per cX column) should be local to the partition.

For an in-memory solution you would have one container per length, containing as many indexes as the length, each index being a hash table pointing to a sorted list (easier merge)

Here is a python description:

class Dictionary:
  def __init__(self, length):
    self.length = length
    self.words = set([])
    self.indexes = collections.defaultdict(set)

  def add(self, word):
    if len(word) != self.length:
      raise RuntimeException(word + ' is not ' + `self.length` + ' characters long')

    if word in self.words:
      raise RuntimeException(word + ' is already in the dictionary')

    self.words.add(word)

    for i in range(0,length):
      self.indexes[(i,word[i])].add(word)

  def search(self, list):
    """list: list of tuples (position,character)
    """
    def compare(lhs,rhs): return cmp(len(lhs),len(rhs))

    sets = [self.indexes[elem] for elem in list]
    sets.sort(compare)
    return reduce(intersection, sets)

I have voluntarily provided the length argument, to minimize the size of the hashes and thus make the search better. Also, the sets are sorted by length so that the computation of the intersection be better :)

Go ahead and test it against other solutions if you wish :)

OTHER TIPS

This question: Good algorithm and data structure for looking up words with missing letters? started out exactly like the one you're asking, but then it was edited to something rather different and easier. Still, you can find some ideas there.

In short, everyone recommends loading the whole dictionary into memory and dividing the words into groups based on their length. From there, you can go many different directions. The more memory you are willing to use up, the faster you can go.

One nice suggestion is to keep a hash table of lists of words of a given length that have a given letter in a given position. You can build it like this (in Python):

# Build a whole lot of sorted word lists
wordlists = collections.defaultdict(list)
for word in sorted(all_words):
    for position, letter in enumerate(word):
        wordlists[len(word), position, letter].append(word)

Now if you need a 6-letter word ending in B, you can just ask for wordlists[6, 5, 'B'] and you've got the complete list. When you know more than one letter, as in ..A..B, you can pick whichever list is shortest and test each word against the desired pattern. My computer's dictionary only has 21 six-letter words ending with B, of which only SCARAB matches.

Since you use a database, create a Suffixes table.
For example :

  Suffix          |   WordID   | SN
  ----------------+------------+----   
  StackOverflow           10      1
  tackOverflow            10      2
  ackOverflow             10      3
  ckOverflow              10      4
  kOverflow               10      5
  ...

With that table it's easy to get all words that contain a particular char in a specific position,
like this:

SELECT WordID FROM suffixes
WHERE suffix >= 't' AND suffix < 'u' AND SN = 2

Get all words which contain 't' at position 2.

Update: if you want to save space, and sacrifice a bit of speed, you can use a suffix array.

You can store all the words in a line (array) with a separator among them, ie the $, and create a suffix array which will have pointers to chars. Now, given a char c you can find all instances of words which contain it rather fast. Still, you'll have to examine if it's in the right position.
(by checking how far it is from the $s)

Probably with the above technique the search will be x10 faster than searching all the words in your original program.

Update 2: I've used the database approach in one of my utilities where I needed to locate suffixes such as "ne", for example, and I forgot to adjust (optimize) it for this specific problem.

You can just store a single char as a suffix:

  Suffix   |   WordID   | SN
  ---------+------------+----   
  S                10      1
  t                10      2
  a                10      3
  c                10      4
  k                10      5
  ...

which saves a lot of space. Now, the query becomes

SELECT WordID FROM suffixes
WHERE suffix = 't' AND SN = 2

You can use a Suffix Tree, or a Trie.

You could store your information in a trie of some sort (perhaps a ternary search tree). An algorithm for partial search using a trie is described in section 6 of this paper by Sedgewick and Bentley. You of course want to have different tries for the various lengths of words. The paper says that the partial search algorithm requires a time of O(n^((k-s)/k)) for s letters being specified in a trie of n k-length words.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top