Question

I'm currently trying to solve the hard Challenge #151 on reddit with a unuasual method, a genetic algorithm.

In short, after seperating a string to consonants and vowels and removing spaces I need to put it together without knowing what character comes first.

hello world is seperated to hllwrld and eoo and needs to be put together again. One solution for example would be hlelworlod, but that doesn't make much sense. The exhaustive approach that takes all possible solutions works, but isn't feasible for longer problem sets.

What I already have

  • A database with the frequenzy of english words
  • An algorithm that constructs a relative cost database using Zipf's law and can consistently seperate words from sentences without spaces (borrowed from this question/answer
  • A method that puts consonants and vowels into a stack and randomly takes a character from either one and encodes this in a string that consists of 1 and 2, effectively encoding the construction in a gene. The correct gene for the example would be 1211212111
  • A method that mutates such a string, randomly swapping characters around

What I tried

Generating 500 random sequences, using the infer_spaces() method and evaluating fitness with the cost of all the words, taking the best 25% and mutate 4 new from those, works for small strings, but falls into local minima very often, especially for longer sequences. Hello World is found already in the first generation, thisisnotworkingverygood (which is correctly seperated and has a cost of 41.223) converges to th iss n ti wo or king v rye good (270 cost) already in the second generation.

What I need

Clearly, using the calculated cost as a evaluation method does only work for the separation of sentences that are grammatically correct, not for for this genetic algorithm. Do you have better ideas I could try? Or is another part of solution, for example the representation of the gene, the problem?

Was it helpful?

Solution

I would simplify the problem into two parts,

  1. Finding candidate words to split the string into (so hllwrld => hll wrld)
  2. How to then expand those words by adding vowels.

I would first take your dictionary of word frequencies, and process it to create a second list of words without vowels, along with a list of the possible vowel list for each collapsed word (and the associated frequency). You technically don't need a GA to solve this (and I think it would be easier to solve without one), but as you asked, I will provide 2 answers:

Without GA: you should be able to solve the first problem using a depth first search, matching substrings of the word against that dictionary, and doing so with the remaining word parts, only accepting partitions of the word into words (without vowels) where all words are in the second dictionary. Then you have to substitute in the vowels. Given that second dictionary, and the partition you already have, this should be easy. You can also use the list of vowels to further constrain the partitioning, as valid words in the partitions can only be made whole using vowels from the vowel list that is input into the algorithm. Starting at the left hand side of the string and iterating over all valid partitions in a depth first manner should solve this problem relatively quickly.

With GA: To solve this with a GA, I would create the dictionary of words without vowels. Then using the GA, create binary strings (as your chromosomes) of the same length as the input string of consonants, where a 1 = split a word at that position, and 0 = leave unchanged. These strings will all be the same length. Then create a fitness function that returns the proportion of words obtained after performing a split using the chromosome that are valid words without vowels, according to that dictionary. Create a second fitness function that takes the valid no-vowel words, and computes the proportion of overlap between the vowels missing in all these valid no-vowel words, and the original vowel list. Combine both fitness functions into one by multiplying the value from the first one by ten (assuming the second one returns a value between 0 and 1). That will force the algorithm to focus on the segmentation problem first and the vowel insertion problem second, and will also favor segmentations that are of the same quality, but preferring those that have a closer set of missing vowels to the original list. I would also include cross over in the solution. As all your chromosomes are the same length, this should be trivial. Once you have a solution that scores perfectly on the fitness function, then it should be trivial to recreate the original sentence given that dictionary of words without vowels (provided you maintain a second dictionary that list the possible missing vowel set for each non-vowel word - there could be multiple for each, as some vowel-less words will be the same with the vowels removed.

OTHER TIPS

Let's say you have several generations and you plot the cost for the best specimen in each generation (we consider long sentence). Does this graph go down or converges after 2-3 generations to a specific value (let the algorithm run for example for 10 generations)? Can you run your algorithm several times with various initial conditions (random sequences) and see whether you get good results sometimes or not?

Depending of the results, you may try the following (this graph is a really good tool to improve the performance):

1) If you have a graph that goes up and down too much all the time - you have too much mutation (average number of swaps per gene for example), try to decrease it.

2) If you stuck up in a local minimum (cost of the best specimen doesn't change much after some time) try to increase mutation or run several isolated populations (3-4) of let's say 100 species at the beginning of your algorithm for a few generations. Then select the best population (that's closer to global minimum) and try to improve it as much as possible through mutation

PS: By the way interesting problem, I tried to figure out on how to use crossover to improve the algorithm but haven't figured it out

The fitness function is the key to the success of GA algorithm ( Which I kind of agree is suitable here ).

I agree with @Simon that the vowel non-vowel separation is not that important. just trip your text corpus to remove the vowels.

what is important in the fitness:

  1. matched word frequency ( frequent words better )
  2. grammar - structure of the sentence ( which you might need to use NLTK to get related infomation )

and don't forget to update the end result ^^

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top