How do you encode the genes when you don't know the length?

https://softwareengineering.stackexchange.com/questions/339509

genetic-algorithms

05-01-2021
|

Question

I think I've got the hang of writing a GA when you know the number of genes in a chromosome. For example, if you're searching for a string, and you know the length, you can just generate your initial population of random strings, then breed and mutate until you (hopefully) converge on a solution. Similarly, in the travelling salesman probelm, you know how many cities there are, and so are only involved with changing the order.

However, you don't always know the number of inputs. Say you want to solve the problem of given the digits 0 through 9 and the operators +, -, * and /, find a sequence that will represent a given target number. The operators will be applied sequentially from left to right as you read (taken from this page). You don't know in advance how long the answer will be, it could be as simple as a single digit, or a complex string of additions, multiplications, etc. For that matter, any given target will have multiple representations (eg 8 * 3 is the same as 2 * 2 * 6 which is the same as 2 * 4 + 4 * 2 and so on).

How would you write a GA for this? You don't know how long to make the gene string in the initial population. You could generate strings of varying length, but then how do you know how far to go? Maybe a good solution would be just one character longer?

I wondered about introducing an extra character into the vocabulary to represent a null place, so the solution "8 * 3" would be represented as "8 * 3 null null null..." but at least two immediate problems with this are a) you still need to pick a maximum length, and b) You would penalise shorter solutions, as they would only be found if you hit a long string of nulls at the end.

Please can anyone explain how you would approach such a probelm.

Solution

Well, after spending some time thinking more about it, I realised that the answer is, in principle at least, not that hard. Apologies in advance that this is going to be a bit long, but it's not that complex.

Generating the initial population

For each chromosome to be created, pick a random number for the length of the gene string. This would be somewhere between 1 (as an empty string wouldn't be very useful) and an arbitrarily chosen maximum (see below for why that turned out not to be a problem after all).

In order to make sure your gene strings can be decoded, you then multiply this by the number of bits required to encode one character.

Thus, your initial population will have a randomly distributed gene string length, each of which can be decoded into a valid string.

Breeding new chromosomes

When crossing two gene strings, pick two crossover points, one for each string. This has the advantage of allowing new lengths to appear.

As we now allow baby chromosomes to have a longer gene string length than either of their parents, the fitness function will ensure that if the target string has a longer length than any of the current chromosomes, then any baby chromosomes who have a longer gene string length than their parents will be favoured, and the average gene string length will grow. This does mean that if the gene string limit imposed when creating the initial population was significantly lower than the target string's length then the algorithm will take longer to converge, as it will spend more time working its way up to the right string length before working on the right characters, but it should still get there in the end.

My initial thought was that mutation would occur as normal, although see my comments right at the end for what happened.

Modifying the fitness function

In order to encourage the GA to converge to the right string length, you add a weighting factor to the fitness function that disadvantages strings of the wrong length. For example, if the regular fitness function for this problem would simply count the number of incorrect bits in the gene string, then you can penalise incorrect string lengths by adding on the difference between the current gene string and the target gene string.

If you generate an initial population of several times the population size used for breeding, then when you pick the best n chromosomes from the initial population, you are more likely to get chromosomes whose gene strings are in the right length region.

The added weight of incorrect length encourages the GA to produce chromosomes with the right gene string length over producing ones with more correct characters. This gets the population to the right length fairly quickly, after which the algorithm spends more time modifying characters to find the right solution.

I tried this out, and it seemed to work very well. For short strings, it converged about as quickly as the fixed-length version, mainly because the decision to produce a large initial population and only pass the best ones to the breeding function meant that the majority of the first generation chromosomes had a gene string length close to the target.

OK, so it wasn't quite that easy!

The only problem I hit is that when I tried it for longer target strings, I often ran into the problem that the crossover produces gene strings whose length is not a multiple of the number of bits required to encode one character (5 in my case). This gave an error when trying to decode the genes. I'm not sure why this wasn't an error with shorter strings.

If I insisted that crossover can only take place at multiples of 5, then I lost the benefit of binary comparison (see the first comment by amon to my question here). It got to within one character of the target string very quickly, then failed to find the final solution at all.

I tried allowing crossover at any point, and then if the length of the new gene string wasn't a multiple of 5, adding some random bits to make it up to the right length, but I hit the problem of getting quick convergence to a sequence that had the right n-1 characters, but that was one character short of the target length.

Finally, I hit upon the simple idea of allowing mutation to increase the gene string length as well. Although convergence wasn't fast, it found the solution every time with this modification. Here are some sample strings, with the average number of generations needed to converge...

____________________________________________________________
| "banana"                                         |    30 |
| "squashy bananas"                                |   120 |
| "i don't like squashy bananas"                   |  1800 |
| "i really don't like squashy bananas very much!" | 20000 |
------------------------------------------------------------

As you can see, the number of generations needed to converge increases pretty rapidly as the string length increases. Not sure how I can improve on that.

The modified code

In case it helps, here are the modified functions. First the Breed() function with the associated MutateGenes() function...

private static List<Chromosome> Breed(List<Chromosome> population, double crossover,
                                           double mutationRate) {
  List<Chromosome> nextGeneration = new List<Chromosome>();
  for (int nChromosome = 0; nChromosome < population.Count() / 2; nChromosome++) {
    Chromosome daddy = Roulette(population);
    Chromosome mummy = Roulette(population);
    int crossoverGeneDaddy = (int)(daddy.Genes.Length * crossover);
    int crossoverGeneMummy = (int)(mummy.Genes.Length * crossover);
    string babyGenes = daddy.Genes.Substring(0, crossoverGeneDaddy) 
                     + mummy.Genes.Substring(crossoverGeneMummy);
    // If the crossover point isn't at a multiple of 5, we will end up with gene strings
    // that can't be decoded into characters. Need to fix that
    while (babyGenes.Length % 5 != 0) {
      // Increase the gene length to be a multiple of 5.
      babyGenes += RandomBit();
    }
    string mutatedGenes = MutateGenes(babyGenes, mutationRate);
    Chromosome baby = new Chromosome(mutatedGenes);
    baby.Fitness = Fitness(baby);
    nextGeneration.Add(baby);
  }
  nextGeneration = nextGeneration
    .Union(population)
    .OrderBy(p => p.Fitness)
    .Take(population.Count())
    .ToList();
  return nextGeneration;
}

private static string MutateGenes(string babyGenes, double mutationRate) {
  string mutatedGenes = "";
  foreach (char gene in babyGenes) {
    mutatedGenes += P() < mutationRate ? (gene == '1' ? '0' : '1') : gene;
  }
  if (P() < mutationRate) {
    for (int i = 0; i < 5; i++) {
      mutatedGenes += RandomBit();
    }
  }
  return mutatedGenes;
}

As before, P() returns a random number between 0 and 1, Roulette() picks a chromosome based on fitness, and RandomBit() returns either "0" or "1" at random.

The fitness function now looks like this...

 private static int FitnessByBit(Chromosome c) {
   int fitness = 0;
   int length = Math.Min(c.Genes.Length, encodedTarget.Length);
   for (int i = 0; i < length; i++) {
     if (c.Genes[i] != encodedTarget[i]) {
       fitness++;
     }
   }
   // This line is the addition that includes the incorrect
   // string length as a weighting factor
   fitness += Math.Abs(c.Genes.Length - encodedTarget.Length);
   return fitness;
 }

I hope this is useful to someone.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange