Question

I am looking for the most efficient algorithm in PHP to check if a string was made from dictionary words only or not.

Example:

thissentencewasmadefromenglishwords
thisonecontainsyxxxyxsomegarbagexaatoo
pure
thisisalsobadxyyyaazzz

Output:

thissentencewasmadefromenglishwords
pure

a.txt

contains the dictionary words

b.txt

contains the strings: one in every line, without spaces made from a..z chars only
Was it helpful?

Solution 3

I recommend a recursive approach. Something like this:

<?php
    $wordsToCheck = array(
        'otherword',
        'word1andother',
        'word1',
        'word1word2',
        'word1word3',
        'word1word2word3'
    );
    $wordList = array(
        'word1',
        'word2',
        'word3'
    );
    $results = array();

    function onlyListedWords($word, $wordList) {
        if (in_array($word, $wordList)) {
            return true;
        } else {
            $length = strlen($word);
            $wordTemp = $word;
            $part = '';
            for ($i=0; $i < $length; $i++) { 
                $part .= $wordTemp[$i];
                if (in_array($part, $wordList)) {
                    if ($i == $length - 1) {
                        return true;
                    } else {
                        $wordTemp = substr($wordTemp, $i + 1);
                        return onlyListedWords($wordTemp, $wordList);
                    }
                }
            }
        }
    }

    foreach ($wordsToCheck as $word) {
        if (onlyListedWords($word, $wordList))
            $results[] = $word;
    }

    var_dump($results);
?>

OTHER TIPS

Another way to do this is to employ the Aho-Corasick string matching algorithm. The basic idea is to read in your dictionary of words and from that create the Aho-Corasick tree structure. Then, you run each string you want to split into words through the search function.

The beauty of this approach is that creating the tree is a one time cost. You can then use it for all of the strings you're testing. The search function runs in O(n) (n being the length of the string), plus the number of matches found. It's really quite efficient.

Output from the search function will be a list of string matches, telling you which words match at what positions.

The Wikipedia article does not give a great explanation of the Aho-Corasick algorithm. I prefer the original paper, which is quite approachable. See Efficient String Matching: An Aid to Bibliographic Search.

So, for example, given your first string:

thissentencewasmadefromenglishwords

You would get (in part):

this, 0
his, 1
sent, 4
ten, 7
etc.

Now, sort the list of matches by position. It will be almost sorted when you get it from the string matcher, but not quite.

Once the list is sorted by position, the first thing you do is make sure that there is a match at position 0. If there is not, then the string fails the test. If there is (and there might be multiple matches at position 0), you take the length of the matched string and see if there's a string match at that position. Add that match's length and see if there's a match at the next position, etc.

If the strings you're testing aren't very long, then you can use a brute force algorithm like that. It would be more efficient, though, to construct a hash map of the matches, indexed by position. Of course, there could be multiple matches for a particular position, so you have to take that into account. But looking to see if there is a match at a position would be very fast.

It's some work, of course, to implement the Aho-Corasick algorithm. A quick Google search shows that there are php implementations available. How well they work, I don't know.

In the average case, this should be very quick. Again, it depends on how long your strings are. But you're helped by there being relatively few matches at any one position. You could probably construct strings that would exhibit pathologically poor runtimes, but you'd probably have to try real hard. And again, even a pathological case isn't going to take too terribly long if the string is short.

This is a problem that can be solved using Dynamic Programming, based on the next formulas:

f(0) = true
f(i) = OR { f(i-j) AND Dictionary.contais(s.substring(i-j,i) } for each j=1,...,i

First, load your file into a dictionary, then use the DP solution for the above formula.

Pseudo code is something like: (Hope I have no "off by one" for indices..)

check(word):
   f = new boolean[word.length() + 1)
   f[0] = true
   for i from 1 to word.length() + 1:
      f[i] = false
      for j from 1 to i-1:
          if dictionary.contains(word.substring(j-1,i-1)) AND f[j]:
             f[i] = true
   return f[word.length()
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top