Creating a generic mathematical formula using a genetic algorithm

https://datascience.stackexchange.com/questions/75193

11-12-2020
|

题

Assuming all of the following;

I have 4 known numbers, all within a 0-400 range, like this:

Variable1   Variable2   Variable3   Variable4
0-400       0-400       0-400       0-400

I know that there is a mathematical relationship between the numbers.
I would like to use a genetic algorithm (computer code) to estimate/approximiate Variable2 and Variable3 based on Variable1 and Variable4.
Also, importantly, assume that there are many input samples and that each sample will differ slightly. Thus, a genetic algorithm optimization of "a mathematical formula/algorithm" to estimate/approximate Variable2 and Variable3 in all cases becomes possible.
(In other words, the genetic algorithm will be able optimize the mathematical formula towards the already-known Variable2 and Variable3 across many input samples, each with a similar though slightly different mathematical formula.)

How can I then write the following into a genetic algorithm:

Variable2=?
Variable3=?

Where ? could be any mathematical function (+/-/*/:/√/^2/cos/sin/tan/etc.) involving Variable2 and Variable3

In other words; I would like the genetic algorithm to build a generic mathematical formula.

How can I define Variable2 and Variable3 as the outcome of a mathematical formula so that estimation by a computer algorithm becomes possible?

I am not sure how to approach this. The genetic algorithm software I use can use as many variables as is needed, and they can be in disparate ranges.

So for example, I could write my algorithm like this easily;

Variable2=Variable1(op)Variable4
Variable3=Variable1(op)Variable4

Where Variable1 is the first variable for the genetic algorithm, with a range of 0-400, and Variable4 is the second variable for the genetic algorithm, with a range of 0-400, and finally (op) is the third variable for the genetic algorithm, for example with a range of 1-4 where 1 stands for +, 2 for -, 3 for *, 4 for : etc.

However, the complexity of this algorithm is very limited and crude; it is not optimized towards a nice and complex real estimation algorithm. Also, as soon as a secondary operator is introduced, for example:

Variable2=[Variable1 or Variable4](op)[Variable1 or Variable4](op)[Variable1 or Variable4]
Variable3=[Variable1 or Variable4](op)[Variable1 or Variable4](op)[Variable1 or Variable4]

The coding complexity for this would start to increase quickly, and there may be a need to use ( and ) to prioritize mathematical calculations, etc. The coding complexity for even more complex calculations becomes seemingly unmanageable.

Is there a better and more straightforward way to let the genetic algorithm approximate/estimate Variable2 and Variable3 based on Variable1 and Variable4 into an overall optimized generic mathematical formula/algorithm?

解决方案

The function approximation method is called "Symbolic Regression" or "Gene Expression Programming" (GEP). If you don't have to re-invent the wheel, there are few shareware (Eureqa, GeneXproTools, DTReg) and one python library.

You can get the trial version of Eureqa from here. However, the academic version is free and fully functional. You can also get the DEMO version of GeneXproTools, from here, or DTReg from here. I used to use them for few years, but I recommend to use Eureqa. It is very easy to use and can be used for both classification and regression. All the genetic algorithm settings are set behind the scene so you only need to select the functions and mathematical operations you need. You can also define your own custom fitness function. In addition, it is multi-threaded and also can be run on AWS cloud, and it converges very quickly.

GeneXproTools and DTReg are for advanced users, but I personally don't recommend them, especially the GeneXproTools. It is single-threaded and its development seems to be stopped since 2015.

If you can code in python, you can also use geppy library. The Github repo is here. I've never used it because I have done whatever I needed with Eureqa.

Final note: If you don't have to solve your problem programmatically, go for Eureqa only.

其他提示

I would suggest to use a sequence of operations of the form op1(var) op2 op1(var) op2 op1(var) ... where op1 refers to a unary operation and op2 refers to a binary operation. Making unary operations mandatory gets rid of a distinction of cases. If you don't want a unary operation in front of a variable, just use the identity function.

I wrote two implementations for unary composition of a list/tuple of unary operations. One in Haskell, one in Python:

square:: Num a => a -> a
square x = x * x

compose_multiple:: [a -> a] -> (a -> a)
compose_multiple []     = (\x -> x)                    -- for the empty case, use the identity
compose_multiple (f:fs) = f . compose_multiple fs  -- recursive case

This gives us 256 for

compose_multiple [square, square, square] 2

In Python:

def compose(op1, op2):
    def result(x):
        return op1(op2(x))
    return result

def compose_multiple(operations: tuple):
    if len(operations) == 1:
            return operations[0]
    return compose(compose_multiple(operations[:-1]), operations[-1])

This gives us 256 for

def square(x):
  return x * x
compose_multiple((square, square, square))(2)

An implementation for the composition of binary operations should work on the lines of the unary case.

The complexity of the final code can be reduced by using infix notation (which is easy to do in Haskell using parenthesis () in the function definition). I have no experience using infix notation in Python. Maybe https://pypi.org/project/infix/ might help.

许可以下： CC-BY-SA 和归因

不隶属于 datascience.stackexchange