Question

I am currently trying to design a Genetic Programming algorithm that analyses a sequence of characters and assigns a value to those characters. Below I have made up an example set. Every line represents a data point. The values that are trained are real-valued. Example: For the word ABCDE the algorithm should return 1.0.

Example dataset:

ABCDE : 1

ABCDEF : 10

ABCDEGH : 3

ABCDELKA : 50

AASD : 3

The dataset could be as large as it is needed, since this is all just made up. Lets assume the rule that the GP should figure out is not too complicated and that its explained by the data.

What i would like the algorithm to do is to approximate the values from my dataset when given the input sequence. My problem now is that each sequence can consist of a different number of characters. I would prefer not to need to write some fancy descriptors myself, if possible.

How can I train my GP (preferably using tinyGP or python) to build this model?

Since there was so much discussion here - a diagram says a thousand words: schematics What I want to do is just put a data point and put that into a function. Then I get a value, which is my result. Unfortunately i do not know this function, I just have a dataset that has some examples (maybe 1000 examples just an example). Now I use the Genetic Programming Algorithm to find an Algorithm that is able to convert my Datapoint into a Result. This is my model. The problem that I have in this case is that the data points are of differing lengths. For a set length I could just specify each of the characters in the string as a input parameter. But beats me what to do if I have a varying number of input parameters.

Disclaimer: I have gotten to this problem multiple times during my studies, but we could never work out a solution that would work out well (like using a window, descriptors, etc). I would like to use a GP, because I like the technology and would like to try it out, but during Uni we also tried this with ANNs, etc, but to no avail. The problem of the variable input size remains.

Was it helpful?

Solution 2

Traditional genetic programming is not suited for variable length input.

It occurs to me some model of evaluation is presupposed in the question.

Consider, for example that you encode your variable length input to a single arbitrary precision value, for example for alphabet of 10 symbols:

ABCD = 1234; ABCDEF = 123456

or

ABCD = 0.1234; ABCDEF = 0.123456

However if this encoding is not natural for the problem domain, it will be quite hard to evolve a program that deals with such input well.

You could also suppose that problem can be adequately represented by a genetically derived finite state machine:

F(F(F(F(init(), A), B), C), D) = 1234

That's a separate field of study from genetic programming, google around, read research papers, perhaps you can find a package that does what you want for you.

Then again your problem may be best represented by yet another transformation, e.g. frequency of bigrams -- such transform is finite length:

# bigrams
# ABCDE => 1
"AA": 0
"AB": 0.25
"AC": 0
"AD": 0
"AE": 0
"BA": 0
"BC": 0.25
"BD": 0
#... up to end of alphabet ...

(0, 0.25, 0, 0, 0, 0, 0.25, 0, ...., 0, ...) => 1      # ABCDE
(0, 0.20, 0, 0, 0, 0, 0.20, 0, ...., 0.20, ...) => 10  # ABCDEF
# input length N^2

# trigrams
(0, 0.33, 0, 0, ..., 0, ...) => 1      # ABCDE
(0, 0.25, 0, 0, ..., 0.25, ...) => 10  # ABCDEF
# input length N^3

Bigrams, trigrams, etc are surprisingly good predictors:

  • capture markov information ("ab" vs "ac")
  • capture relative position ("ab" && "bc" vs "ed" && "bc")
  • capture non-linear semantics ("abab" != "ab" * 2)
  • resistant to shuffled input ("buy new spam" vs "buy spam it's new")

These are often used in natural language problems, such as text topic detection, author detection, spam protection; biotech, such as dna and rna sequences, etc.

However there is no guarantee this approach is applicable to your problem. It truly depends on you problem domain, for example consider alphabet 10+ in arithmetics domain, the following two inputs become indistinguishable, yet yield different results:

10000+10000 = 20000
1000+100000 = 101000

In this case you need something like a register machine:

init: tmp = 0; res = 0
"0": tmp *= 10
"1": tmp *= 10; tmp += 1
"+": res += tmp; tmp = 0
end: res += tmp

OTHER TIPS

Since you have not a fitness function, you will need to treat the genetic algorithm as it was a classifier. So you will need to come up with a way to evaluate a single chromosome. As others suggested you, this is a pure classification problem, not an optimization one, but, if you still want to go ahead with GA, here you have some steps to try an initial approach:

You will need:

  1. Description of (how to encode) a valid chromosome

To work with genetic algorithms, all the solutions must have same length (there are more advanced approach with variable length enconding, but I wont enter there). So, having that, you will need to find an optimal encode method. Knowing that your input is a variable length string, you can encode your chromosome as a lookup table (dictionary in python) for your alphabet. However, a dictionary will give you some problems when you try to apply crossover or mutation operations, so is better to have the alphabet and chromosome encoding splitted. Refering to language models, you can check for n-grams, and your chromosome will have the same length as the length of your alphabet:

.. Unigrams

alphabet = "ABCDE"
chromosome1 = [1, 2, 3, 4, 5]
chromosome2 = [1, 1, 2, 1, 0]

.. Bigrams

alphabet = ["AB", "AC", "AD", "AE", "BC", "BD", "BE", "CD", "CE", "DE"]
chromosome = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

.. Trigrams

alphabet = ["ABC", "ABD", "ABE"...]
chromosome = as above, a value for each combination

2. Decode a chromosome to evaluate a single input

Your chromosome will represent an integer values for each element in your alphabet. So if you want to know a value of one of your inputs (variable length string) having a chromosome, you will need to try some evaluation functions, simplest one is the sum of each letter value.

alphabet = "ABC"
chromosome = [1, 2, 1]
input = "ABBBC"

# acc = accumulated value
value = reduce(lambda acc, x: acc + chromosme[alphabet.index(x)], input, 0)
# Will return ABBBC = 1+2+2+2+1 = 8

3. Fitness function

Your fitness function is just a simple error function. You can use simple error sum, square error... A simple evaluation function for a single gen:

def fitnessFunction(inputs, results, alphabet, chromosome):
    error = 0

    for i in range(len(inputs)):
        value = reduce(lambda acc, x: acc + chromosome[alphabet.index(x)], inputs[i], 0) 
        diff = abs(results[i] - value)
        error += diff # or diff**2 if you want squared error

    return error

# A simple call -> INPUTS, EXPECTED RESULTS, ALPHABET, CURRENT CHROMOSOME
fitnessFunction(["ABC", "ABB", "ABBC"], [1,2,3], "ABC", [1, 1, 0])
# returned error will be:
# A+B+C = 1 + 1 + 0 -- expected value = 1 --> error += 1
# A+B+B = 1 + 1 + 1 -- expected value = 2 --> error += 1
# A+B+C = 1 + 1 + 1 + 0 -- expected value = 3 --> error += 0
# This chromosome has error of 2

Now, using any crossover and mutation operator you want (e.g.: one point crossover and bit flip mutation), find the chromosome that minimizes that error.

Things you can try to improve the algorithm model:

  • Using bigrams or trigrams
  • Change evaluate method (currently is a sum of lookup table values, it can be a product or something more complex)
  • Try using real values in chromosomes, instead of just integers
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top