How to do a character search in any order (12 letters from which 6 should form a word) with PHP?

https://stackoverflow.com/questions/9725792

24-05-2021
|

Question

I am thinking about this all day and can't seem to figure out an memory efficient and speedy way. The problem is:

for example, I have these letters: e f j l n r r t t u w x (12 letters)

I am looking for this word TURTLE (6 letters)

How do I find all the possible words in the full range (12 words) with php? ( Or with python, if that might be a lot easier? )

Things I've tried:

Using permutations: I have made all strings possible using a permutation algorithm, put them in array (only the ones 6 chars long) and do an in_array to check if it matches one of the words in my array with valid words (in this case, containing TURTLE, but sometimes two or three words). This calculating costs a lot of memory and time, especially with 6+ characters to get permutations of.
creating a regex (I am bad at this). I wanted to create a regex to check if 6 of the 12 (input) characters are in a word from the "valid array". problem is, we don't know what letter from the 12 will be the starting position and the position of the other words.

An example of this would be: http://drawsomethingwords.net/

I hope you can help me with this problem, as I would really like to fix this. Thanks for all of your time :)

Solution

I've encountered similar problems when writing a crossword editor (e.g., find all words of length 5 with a 'B' in second position). Basically it comes down to:

Process a word list and organize words by length (i.e., a list of all words of length 2, length 3, length 4, etc). The reason is that you often know the length of the word(s) that you wish to search for. If you want to search for words of unknown length, you can repeat a search again for a different word list.
Insert each separate word list into a tertiary search tree which makes searching for words a lot faster. Each node in the tree contains a character and you can descend the tree to search for words. There are also specialized data structures such as a trie but I have not (yet) explored.

Now for your problem, you could use the search tree to write a search function such as

function findWords($tree, $letters) {
   // ...
}

where tree is the search tree containing the words of the length that you wish to search for and letters is a list of valid characters. In your example, letters would be the string efjlnrrttuwx.

The search tree allows you to search for words, one character at a time, and you can keep track of characters that you have encountered so far. As long as these characters are in the list of valid letters, you keep searching. Once you've encountered a leaf node in the search tree, you have found an existing word which you can add to the result. If you encounter a character which is not in letters (or it has already been used), you can skip that word and continue the search elsewhere in the search tree.

My crossword editor Palabra contains an implementation of the above steps (a part is done in Python but mostly in C). It works fast enough for Ubuntu's default word list containing roughly 70K words.

OTHER TIPS

There are probably better ways, but this is just off the top of my head:

I assume you have a database of words (i.e. dictionary). Add fields a-z to the database table. Write a script that sums up the count of each letter in the word and writes them in the a-z fields as an integer. I.E. for balloon, the table would look like:

id    name       a    b  ...  l  ...  n  ...  o
1     balloon    1    1       2  ...  1  ...  2

Then when the user enters a word, you calculate how many of each character are in that word and match that up with the database.

// User enters 'zqlamonrlob'
// You count the letters:
a b c d e f g h i j k l m n o p q r s t u v w x y z
1 1 0 0 0 0 0 0 0 0 0 2 1 1 2 0 1 1 0 0 0 0 0 0 0 1

// Query the database
$sql = "SELECT `name` FROM `my_table` WHERE `a` <= {$count['a'] AND `b` <= {$count['b'] ...}";

That would get you a list of words that use some or all of the letters that the user entered.

Here's a regex, just to show it can (but not necessarily should) be done:

preg_match('/^(?:t()|u()|r()|t()|l()|e()|.)+$\1\2\3\4\5\6/i', 'efjlnrrttuwx')

matches.

How does it work? Empty capturing parentheses always match if the preceding letter matches. The backreferences at the end of the regex make sure that each of the characters has participated in the match. Therefore,

preg_match('/^(?:t()|u()|r()|t()|l()|e()|.)+$\1\2\3\4\5\6/i', 'efjlnrrtuwx')

(correctly) will not match because there is only one t in the string but the regex needs two different ts.

The problem is that of course the regex engine has to check many permutations to arrive at this conclusion. While a successful match may be quite fast (175 steps of the regex engine in the first case), an unsuccessful match attempt can be expensive (3816 steps in the second case).

I think you need to approach this problem from the opposite direction.

Loop through your list of words, testing the words with the specified number of characters, to see if the word characters are in the specified character set.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow