Essentially your problem is a tree problem, where at every level all the words that form a prefix of the tree form branches. The branch that leaves no part of the string is a correct solution.
thisisinsane
|
|
(this)isinsane
/ \
/ \
(this,i)sinsane (this,is)insane
/ / \
/ / \
(this,i,sin)ane (this,is,in)sane (this,is,insane)
/
/
(this,is,in,sane)
So in this example there are two solutions, but we want to select the solution using the longest words, that is we want to explore the tree from the right using a depth-first-search strategy.
So our algorithm should:
- Sort the dictionary by descending length.
- Find all prefixes of the current string. If there are none, return
False
. - Set
prefix
to the longest unexplored prefix. - Remove it from the string. If the string is empty, we found a solution, return a list of all prefixes.
- Recurse to 2.
- This branch has failed, return
False
.
A sample implementation of this solution:
def segment(string,wset):
"""Segments a string into words prefering longer words givens
a dictionary wset."""
# Sort wset in decreasing string order
wset.sort(key=len, reverse=True)
result = tokenize(string, wset, "")
if result:
result.pop() # Remove the empty string token
result.reverse() # Put the list into correct order
return result
else:
raise Exception("No possible segmentation!")
def tokenize(string, wset, token):
"""Returns either false if the string can't be segmented by
the current wset or a list of words that segment the string
in reverse order."""
# Are we done yet?
if string == "":
return [token]
# Find all possible prefixes
for pref in wset:
if string.startswith(pref):
res = tokenize(string.replace(pref, '', 1), wset, pref)
if res:
res.append(token)
return res
# Not possible
return False
print segment("thisisinsane", ['this', 'is', 'in', 'insane']) # this is insane
print segment("shareasale", ['share', 'a', 'sale', 'as']) # share a sale
print segment("asignas", ['as', 'sign', 'a']) # a sign as