Why do we use binary encoding when it seems so inefficient?

https://softwareengineering.stackexchange.com/questions/339705

genetic-algorithms

05-01-2021
|

Question

When encoding our chromosome's characteristics (for want of a better word), binary seems to be the favoured method. I understand that this gives the maximum possibilities for crossover and mutation, but it also seems to have a serious limitation.

For example, suppose I am trying to solve the problem described here, given the digits 0 through 9 and the operators +, -, * and /, find a sequence that will represent a given target number. The operators will be applied sequentially from left to right as you read. This requires the digits 1 to 9, as well as the four operators, giving 13 characters to be encoded. Thus, I need to use a binary representation with a length of 4, with a total of 16 possible binary strings.

Now, for a sequence to be valid in that problem, it would need to be of the form...

d o d o d ... o d

...where d means a digit and o means an operator. Suppose you are looking at a sequence of length 5 (eg 1 + 2 * 3). There are 9 binary representations that are valid for digits (ie probability 0.5625) and 4 that are valid for operators (probability 0.25). Thus, there is only a probability of 0.5625 * 0.25 * 0.5625 * 0.25 * 0.5625 = 0.011124 of a random binary string being a valid sequence. In other words, only about 1% of the strings will be valid.

This seems hugely inefficient. Crossover and mutation are going to invalidate any existing valid strings, so I don't see how the GA would ever converge.

Related to this is the question of how to handle invalid binary strings. Suppose you've crossed and mutated, and you end up with an invalid string. Do you just assign a huge fitness value, so it will be discarded as soon as possible, or do you throw it away and try and find a valid child chromosome? The former option sounds inefficient as you would have very few valid chromosomes in your population, and the latter sounds just as inefficient, as you would spend ages trying to find valid binary strings.

Forgive me if this is a dumb question, but I'm still quite new to GAs, and am struggling to understand what you would do in a case like this.

Solution

Picking the right way to represent the genotype is quite important when using a genetic algorithm. There are many ways to do it, binary being one of them.

The reason why you might think that binary is most used is because it is simplest to implement and often used in academic settings. But in real world, lots of work goes into creating proper genotype representation as to solve the exact problems you are describing.

There is also some historical baggage. Binary can be made quite space-efficient, so it would be used in times, where memory was hard to come by. So it would be good pick when genetic algorithms were first explored, which was pretty much when computers became more academically available. But this is not really a problem now, when you have access to gigabytes of memory and main problem is often time it takes to calculate the fitness, not how much memory the genotype takes.

Some more details on wikipedia.

Finding a suitable representation of the problem domain for a chromosome is an important consideration, as a good representation will make the search easier by limiting the search space; similarly, a poorer representation will allow a larger search space. The mutation operator and crossover operator employed by the genetic algorithm must also take into account the chromosome's design.

OTHER TIPS

As others have said, binary is useful for efficient manipulation of the representation. It has also been pointed out that how you encode the representation is important.

To bring these two together, I'd like to offer a representative of your length 5 example that should give you a flavour of how the representation can affect the efficiency.

So, we are looking for a binary representation for d o d o d where d is in (0-9) and o is in (+ - x /).

We start by noting that there are 4 operators in o, so we could encode each in 2 bits.

Secondly, are looking for a Binary Coded Decimal representation for d. We note that there are 3 decimal numbers. If we were to encode these together, we would find that we have 10^3 (1000) combinations which can be encoded in 10 bits (2^10=1024). There are well known encodings of BCD for 3 numbers e.g. Chen-Ho Encoding that do indeed use 10 bits.

So, for this example representation, we could encode as 3 decimals using Chen-Ho plus 2 2-bit operators. This would give us an encoding efficiency of 1000/1024 (x1x1) ~= 97.7%. Somewhat better than the 1% in the candidate encoding.

This encoding is not perfect, of course. In particular it doesn't trivially scale to arbitrary sequences. But hopefully it gives a sense of how the encoding representation can markedly affect the encoding efficiency.

We used ASCII to solve a number of problems in my AI class, so it's not like we always use binary.

Using binary encoding is just a cheap way to keep the odds of producing meaningless nonsense chromosomes to a minimum. If you have 15 symbols 4 random bits have a 15/16 chance of being meaningful. 8 have a 15/256 chance.

If you don't like having to decode the binary, that whole problem can be avoided by being a little smarter about what you randomize (mutate) in the first place. Do this right and you can work in ASCII almost as efficiently as binary.

Some explanations of genetic algorithms stick to binary just because they don't want to distract you with the ASCII encoding shenanigans. There are many ways to encode. There is no reason to think perfectly packed binary is always best. O(1/2 n) is still just O(n). Beware of micro-optimizations.

Binary enconding is:

(oftentimes) the most memory-efficient one as most genes can be expressed as single boolean,
the most efficient for most popular crossover algorithms (require very few boolean algebra operations).

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange