Text segmentation: Algorithm to match input with the longest words from the dictionary

Question 1

Essentially your problem is a tree problem, where at every level all the words that form a prefix of the tree form branches. The branch that leaves no part of the string is a correct solution.

                      thisisinsane
                          |
                          |
                     (this)isinsane
                     /            \
                    /              \
          (this,i)sinsane         (this,is)insane
              /                     /           \
             /                     /             \
  (this,i,sin)ane          (this,is,in)sane    (this,is,insane)
                                /
                               /
                       (this,is,in,sane)

So in this example there are two solutions, but we want to select the solution using the longest words, that is we want to explore the tree from the right using a depth-first-search strategy.

So our algorithm should:

Sort the dictionary by descending length.
Find all prefixes of the current string. If there are none, return False.
Set prefix to the longest unexplored prefix.
Remove it from the string. If the string is empty, we found a solution, return a list of all prefixes.
Recurse to 2.
This branch has failed, return False.

A sample implementation of this solution:

def segment(string,wset):
    """Segments a string into words prefering longer words givens
    a dictionary wset."""
    # Sort wset in decreasing string order
    wset.sort(key=len, reverse=True)
    result = tokenize(string, wset, "")
    if result:
        result.pop() # Remove the empty string token
        result.reverse() # Put the list into correct order
        return result
    else:
        raise Exception("No possible segmentation!")

def tokenize(string, wset, token):
    """Returns either false if the string can't be segmented by 
    the current wset or a list of words that segment the string
    in reverse order."""
    # Are we done yet?
    if string == "":
        return [token]
    # Find all possible prefixes
    for pref in wset:
        if string.startswith(pref):
            res = tokenize(string.replace(pref, '', 1), wset, pref)
            if res:
                res.append(token)
                return res
    # Not possible
    return False

print segment("thisisinsane", ['this', 'is', 'in', 'insane']) # this is insane
print segment("shareasale", ['share', 'a', 'sale', 'as'])     # share a sale
print segment("asignas", ['as', 'sign', 'a'])                 # a sign as

Question 2

Just do a recursive scan, each time adding a single letter to the last word (if it's in the dictionary), and also trying letting it start a new word. This means that at each call, you have either 1 or 2 hypotheses to test (whether there's a space or not). When you reach the end of the input, and you have a valid set of words, save this solution, if the first word in it is longer than the best solution you've found so far.

Example code:

words=['share','as','a','sale','bla','other','any','sha','sh']
wdict={}
best=[]

def scan(input,i,prevarr):
    global best
    arr=list(prevarr)
    # If array is empty, we automatically add first letter
    if len(arr)<1:
        arr.append(input[0:1])
        return scan(input,i+1,arr)
    # If no more input is available, evaluate the solution
    if i>=len(input):
        # Is the last word a valid word
        if wdict.has_key(arr[-1]):
            # Is there a current best solution?
            if len(best)==0:
                best=arr      # No current solution so select this one
            elif len(arr[0])>len(best[0]):
                best=arr      # If new solution has a longer first word
        return best
    # If the last word in the sequence is a valid word, we can add a space and try
    if wdict.has_key(arr[-1]):
        arr.append(input[i:i+1])
        scan(input,i+1,arr)
        del arr[-1]
    # Add a letter to the last word and recurse
    arr[-1]=arr[-1]+input[i:i+1]
    return scan(input,i+1,arr)

def main():
    for w in words:
        wdict[w]=True
    res=scan('shareasasale',0,[])
    print res

if __name__ == '__main__':
    main()

Question 3

The algorithm I would use would be something like:

Have your dictionary sorted by decreasing length
Build a function prefix(s) which yields, in order, dictionary words which start s
- ie. prefix("informativecow") yields first "informative", then "inform", then "info", then "in"
Build a recursive function test(s) which:
- if s is the empty string, returns True
- for each prefix, remove the prefix from s and call test() with that string. If you find one that's true, return true. If you don't find any that are true, return False.

Question 4

For a real-world example of how to do English word segmentation, look at the source of the Python wordsegment module. It's a little more sophisticated because it uses word and phrase frequency tables but it illustrates the memoization approach.

In particular, score illustrates the segmentation approach:

def score(word, prev=None):
    "Score a `word` in the context of the previous word, `prev`."

    if prev is None:
        if word in unigram_counts:

            # Probability of the given word.

            return unigram_counts[word] / 1024908267229.0
        else:
            # Penalize words not found in the unigrams according
            # to their length, a crucial heuristic.

            return 10.0 / (1024908267229.0 * 10 ** len(word))
    else:
        bigram = '{0} {1}'.format(prev, word)

        if bigram in bigram_counts and prev in unigram_counts:

            # Conditional probability of the word given the previous
            # word. The technical name is *stupid backoff* and it's
            # not a probability distribution but it works well in
            # practice.

            return bigram_counts[bigram] / 1024908267229.0 / score(prev)
        else:
            # Fall back to using the unigram probability.

            return score(word)

If you replaced the score function so that it returned higher scores for longer matches in your dictionary then it would do as you want.