How can I train a Genetic Programming algorithm onto a variable sequence of descriptors?

Question 1

Traditional genetic programming is not suited for variable length input.

It occurs to me some model of evaluation is presupposed in the question.

Consider, for example that you encode your variable length input to a single arbitrary precision value, for example for alphabet of 10 symbols:

ABCD = 1234; ABCDEF = 123456

or

ABCD = 0.1234; ABCDEF = 0.123456

However if this encoding is not natural for the problem domain, it will be quite hard to evolve a program that deals with such input well.

You could also suppose that problem can be adequately represented by a genetically derived finite state machine:

F(F(F(F(init(), A), B), C), D) = 1234

That's a separate field of study from genetic programming, google around, read research papers, perhaps you can find a package that does what you want for you.

Then again your problem may be best represented by yet another transformation, e.g. frequency of bigrams -- such transform is finite length:

# bigrams
# ABCDE => 1
"AA": 0
"AB": 0.25
"AC": 0
"AD": 0
"AE": 0
"BA": 0
"BC": 0.25
"BD": 0
#... up to end of alphabet ...

(0, 0.25, 0, 0, 0, 0, 0.25, 0, ...., 0, ...) => 1      # ABCDE
(0, 0.20, 0, 0, 0, 0, 0.20, 0, ...., 0.20, ...) => 10  # ABCDEF
# input length N^2

# trigrams
(0, 0.33, 0, 0, ..., 0, ...) => 1      # ABCDE
(0, 0.25, 0, 0, ..., 0.25, ...) => 10  # ABCDEF
# input length N^3

Bigrams, trigrams, etc are surprisingly good predictors:

capture markov information ("ab" vs "ac")
capture relative position ("ab" && "bc" vs "ed" && "bc")
capture non-linear semantics ("abab" != "ab" * 2)
resistant to shuffled input ("buy new spam" vs "buy spam it's new")

These are often used in natural language problems, such as text topic detection, author detection, spam protection; biotech, such as dna and rna sequences, etc.

However there is no guarantee this approach is applicable to your problem. It truly depends on you problem domain, for example consider alphabet 10+ in arithmetics domain, the following two inputs become indistinguishable, yet yield different results:

10000+10000 = 20000
1000+100000 = 101000

In this case you need something like a register machine:

init: tmp = 0; res = 0
"0": tmp *= 10
"1": tmp *= 10; tmp += 1
"+": res += tmp; tmp = 0
end: res += tmp

Question 2

Since you have not a fitness function, you will need to treat the genetic algorithm as it was a classifier. So you will need to come up with a way to evaluate a single chromosome. As others suggested you, this is a pure classification problem, not an optimization one, but, if you still want to go ahead with GA, here you have some steps to try an initial approach:

You will need:

Description of (how to encode) a valid chromosome

To work with genetic algorithms, all the solutions must have same length (there are more advanced approach with variable length enconding, but I wont enter there). So, having that, you will need to find an optimal encode method. Knowing that your input is a variable length string, you can encode your chromosome as a lookup table (dictionary in python) for your alphabet. However, a dictionary will give you some problems when you try to apply crossover or mutation operations, so is better to have the alphabet and chromosome encoding splitted. Refering to language models, you can check for n-grams, and your chromosome will have the same length as the length of your alphabet:

.. Unigrams

alphabet = "ABCDE"
chromosome1 = [1, 2, 3, 4, 5]
chromosome2 = [1, 1, 2, 1, 0]

.. Bigrams

alphabet = ["AB", "AC", "AD", "AE", "BC", "BD", "BE", "CD", "CE", "DE"]
chromosome = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

.. Trigrams

alphabet = ["ABC", "ABD", "ABE"...]
chromosome = as above, a value for each combination

2. Decode a chromosome to evaluate a single input

Your chromosome will represent an integer values for each element in your alphabet. So if you want to know a value of one of your inputs (variable length string) having a chromosome, you will need to try some evaluation functions, simplest one is the sum of each letter value.

alphabet = "ABC"
chromosome = [1, 2, 1]
input = "ABBBC"

# acc = accumulated value
value = reduce(lambda acc, x: acc + chromosme[alphabet.index(x)], input, 0)
# Will return ABBBC = 1+2+2+2+1 = 8

3. Fitness function

Your fitness function is just a simple error function. You can use simple error sum, square error... A simple evaluation function for a single gen:

def fitnessFunction(inputs, results, alphabet, chromosome):
    error = 0

    for i in range(len(inputs)):
        value = reduce(lambda acc, x: acc + chromosome[alphabet.index(x)], inputs[i], 0) 
        diff = abs(results[i] - value)
        error += diff # or diff**2 if you want squared error

    return error

# A simple call -> INPUTS, EXPECTED RESULTS, ALPHABET, CURRENT CHROMOSOME
fitnessFunction(["ABC", "ABB", "ABBC"], [1,2,3], "ABC", [1, 1, 0])
# returned error will be:
# A+B+C = 1 + 1 + 0 -- expected value = 1 --> error += 1
# A+B+B = 1 + 1 + 1 -- expected value = 2 --> error += 1
# A+B+C = 1 + 1 + 1 + 0 -- expected value = 3 --> error += 0
# This chromosome has error of 2

Now, using any crossover and mutation operator you want (e.g.: one point crossover and bit flip mutation), find the chromosome that minimizes that error.

Things you can try to improve the algorithm model:

Using bigrams or trigrams
Change evaluate method (currently is a sum of lookup table values, it can be a product or something more complex)
Try using real values in chromosomes, instead of just integers