Checking if word segmentation is possible

https://stackoverflow.com/questions/10272417

02-06-2021
|

Question

This is a follow up question to this response and the pseudo-code algorithm that the user posted. I didn't comment on that question because of its age. I am only interested in validating whether or not a string can be split into words. The algorithm doesn't need to actually split the string. This is the response from the linked question:

Let S[1..length(w)] be a table with Boolean entries. S[i] is true if the word w[1..i] can be split. Then set S[1] = isWord(w[1]) and for i=2 to length(w) calculate

S[i] = (isWord[w[1..i] or for any j in {2..i}: S[j-1] and isWord[j..i]).

I'm translating this algorithm into simple python code, but I'm not sure if I'm understanding it properly. Code:

def is_all_words(a_string, dictionary)):
    str_len = len(a_string)
    S = [False] * str_len
    S[0] = is_word(a_string[0], dictionary)
    for i in range(1, str_len):
        check = is_word(a_string[0:i], dictionary)
        if (check):
            S[i] = check
        else:
            for j in range(1, str_len):
                check = (S[j - 1] and is_word(a_string[j:i]), dictionary)
                if (check):
                    S[i] == True
                    break
    return S

I have two related questions. 1) Is this code a proper translation of the linked algorithm into Python, and if it is, 2) Now that I have S, how do I use it to tell if the string is only comprised of words? In this case, is_word is a function that simply looks a given word up in a list. I haven't implemented it as a trie yet.

UPDATE: After updating the code to include the suggested change, it doesn't work. This is the updated code:

def is_all_words(a_string, dictionary)):
    str_len = len(a_string)
    S = [False] * str_len
    S[0] = is_word(a_string[0], dictionary)
    for i in range(1, str_len):
        check = is_word(a_string[0:i], dictionary)
        if (check):
            S[i] = check
        else:
            for j in range(1, i): #THIS LINE WAS UPDATED
                check = (S[j - 1] and is_word(a_string[j:i]), dictionary)
                if (check):
                    S[i] == True
                    break
    return S

a_string = "carrotforever"
S = is_all_words(a_string, dictionary)
print(S[len(S) - 1]) #prints FALSE

a_string = "hello"
S = is_all_words(a_string, dictionary)
print(S[len(S) - 1]) #prints TRUE

It should return True for both of these.

Solution

Here is a modified version of your code that should return good results. Notice that your mistake was simply in the translation from pseudocode array indexing (starting at 1) to python array indexing (starting at 0) therefore S[0] and S[1] where populated with the same value where S[L-1] was actually never computed. You can easily trace this mistake by printing the whole S values. You will find that S[3] is set true in the first example where it should be S[2] for the word "car". Also you could speed up the process by storing the index of composite words found so far, instead of testing each position.

def is_all_words(a_string, dictionary):
    str_len = len(a_string)
    S = [False] * (str_len)
# I replaced is_word function by a simple list lookup, 
# feel free to replace it with whatever function you use. 
# tries or suffix tree are best for this.
    S[0] = (a_string[0] in dictionary) 
    for i in range(1, str_len):
        check = a_string[0:i+1] in dictionary # i+1 instead of i
        if (check):
            S[i] = check
    else:
        for j in range(0,i+1): # i+1 instead of i
            if (S[j-1] and (a_string[j:i+1] in dictionary)): # i+1 instead of i
            S[i] = True
            break


    return S

a_string = "carrotforever"
S = is_all_words(a_string, ["a","car","carrot","for","eve","forever"])
print(S[len(a_string)-1]) #prints TRUE

a_string = "helloworld"
S = is_all_words(a_string, ["hello","world"])
print(S[len(a_string)-1]) #prints TRUE

OTHER TIPS

For a real-world example of how to do English word segmentation, look at the source of the Python wordsegment module. It's a little more sophisticated because it uses word and phrase frequency tables but it illustrates the recursive approach. By modifying the score function you can prioritize longer matches.

Installation is easy with pip:

$ pip install wordsegment

And segment returns a list of words:

>>> import wordsegment
>>> wordsegment.segment('carrotfever')
['carrot', 'forever']

1) at first glance, looks good. One thing: for j in range(1, str_len): should be for j in range(1, i): I think

2) if S[str_len-1]==true, then the whole string should consist of whole words only.

After all S[i] is true iff

the whole string from 0 to i consists of a single dictionary word
OR there exists a S[j-1]==true with j<i, and the string[j:i] is a single dictionaryword

so if S[str_len-1] is true, then the whole string is composed out of dictionary words

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow