Split strings into Dictionary words

Question 1

I recommend a recursive approach. Something like this:

<?php
    $wordsToCheck = array(
        'otherword',
        'word1andother',
        'word1',
        'word1word2',
        'word1word3',
        'word1word2word3'
    );
    $wordList = array(
        'word1',
        'word2',
        'word3'
    );
    $results = array();

    function onlyListedWords($word, $wordList) {
        if (in_array($word, $wordList)) {
            return true;
        } else {
            $length = strlen($word);
            $wordTemp = $word;
            $part = '';
            for ($i=0; $i < $length; $i++) { 
                $part .= $wordTemp[$i];
                if (in_array($part, $wordList)) {
                    if ($i == $length - 1) {
                        return true;
                    } else {
                        $wordTemp = substr($wordTemp, $i + 1);
                        return onlyListedWords($wordTemp, $wordList);
                    }
                }
            }
        }
    }

    foreach ($wordsToCheck as $word) {
        if (onlyListedWords($word, $wordList))
            $results[] = $word;
    }

    var_dump($results);
?>

Question 2

Another way to do this is to employ the Aho-Corasick string matching algorithm. The basic idea is to read in your dictionary of words and from that create the Aho-Corasick tree structure. Then, you run each string you want to split into words through the search function.

The beauty of this approach is that creating the tree is a one time cost. You can then use it for all of the strings you're testing. The search function runs in O(n) (n being the length of the string), plus the number of matches found. It's really quite efficient.

Output from the search function will be a list of string matches, telling you which words match at what positions.

The Wikipedia article does not give a great explanation of the Aho-Corasick algorithm. I prefer the original paper, which is quite approachable. See Efficient String Matching: An Aid to Bibliographic Search.

So, for example, given your first string:

thissentencewasmadefromenglishwords

You would get (in part):

this, 0
his, 1
sent, 4
ten, 7
etc.

Now, sort the list of matches by position. It will be almost sorted when you get it from the string matcher, but not quite.

Once the list is sorted by position, the first thing you do is make sure that there is a match at position 0. If there is not, then the string fails the test. If there is (and there might be multiple matches at position 0), you take the length of the matched string and see if there's a string match at that position. Add that match's length and see if there's a match at the next position, etc.

If the strings you're testing aren't very long, then you can use a brute force algorithm like that. It would be more efficient, though, to construct a hash map of the matches, indexed by position. Of course, there could be multiple matches for a particular position, so you have to take that into account. But looking to see if there is a match at a position would be very fast.

It's some work, of course, to implement the Aho-Corasick algorithm. A quick Google search shows that there are php implementations available. How well they work, I don't know.

In the average case, this should be very quick. Again, it depends on how long your strings are. But you're helped by there being relatively few matches at any one position. You could probably construct strings that would exhibit pathologically poor runtimes, but you'd probably have to try real hard. And again, even a pathological case isn't going to take too terribly long if the string is short.

Question 3

This is a problem that can be solved using Dynamic Programming, based on the next formulas:

f(0) = true
f(i) = OR { f(i-j) AND Dictionary.contais(s.substring(i-j,i) } for each j=1,...,i

First, load your file into a dictionary, then use the DP solution for the above formula.

Pseudo code is something like: (Hope I have no "off by one" for indices..)

check(word):
   f = new boolean[word.length() + 1)
   f[0] = true
   for i from 1 to word.length() + 1:
      f[i] = false
      for j from 1 to i-1:
          if dictionary.contains(word.substring(j-1,i-1)) AND f[j]:
             f[i] = true
   return f[word.length()