Using genetic programming to estimate probability

https://stackoverflow.com/questions/13010338

13-07-2021
|

Question

I would like to use a genetic program (gp) to estimate the probability of an 'outcome' from an 'event'. To train the nn I am using a genetic algorithm.

So, in my database I have many events, with each event containing many possible outcomes.

I will give the gp a set of input variables that relate to each outcome in each event.

My questions is - what should the fitness function be in the gp be ????

For instance, right now I am giving the gp a set of input data (outcome input variables), and a set of target data (1 if outcome DID occur, 0 if outcome DIDN'T occur, with the fitness function being the mean squared error of the outputs and targets). I then take the sum of each output for each outcome, and divide each output by the sum (to give the probability). However, I know for sure that this is not the right way to be doing this.

For clarity, this is how I am CURRENTLY doing this:

I would like to estimate the probability of 5 different outcomes occurring in an event:

Outcome 1 - inputs = [0.1, 0.2, 0.1, 0.4] 
Outcome 1 - inputs = [0.1, 0.3, 0.1, 0.3] 
Outcome 1 - inputs = [0.5, 0.6, 0.2, 0.1] 
Outcome 1 - inputs = [0.9, 0.2, 0.1, 0.3] 
Outcome 1 - inputs = [0.9, 0.2, 0.9, 0.2]

I will then calculate the gp output for each input:

Outcome 1 - output = 0.1 
Outcome 1 - output = 0.7 
Outcome 1 - output = 0.2 
Outcome 1 - output = 0.4 
Outcome 1 - output = 0.4

The sum of the outputs for each outcome in this event would be: 1.80. I would then calculate the 'probability' of each outcome by dividing the output by the sum:

Outcome 1 - p = 0.055 
Outcome 1 - p = 0.388 
Outcome 1 - p = 0.111 
Outcome 1 - p = 0.222 
Outcome 1 - p = 0.222

Before you start - I know that these aren't real probabilities, and that this approach does not work !! I just put this here to help you understand what I am trying to achieve.

Can anyone give me some pointers on how I can estimate the probability of each outcome ? (also, please note my maths is not great)

Many thanks

Solution

I understand the first part of your question: What you described is a classification problem. You're learning if your inputs relate to whether an outcome was observed (1) or not (0).

There are difficulties with the second part though. If I understand you correctly you take the raw GP output for a certain row of inputs (e.g. 0.7) and treat it as a probability. You said this doesn't work, obviously. In GP you can do classification by introducing a threshold value that splits your classes. If it's bigger than say 0.3 the outcome should be 1 if it's smaller it should be 0. This threshold isn't necessarily 0.5 (again it's just a number, not a probability).

I think if you want to obtain a probability you should attempt to learn multiple models that all explain your classification problem well. I don't expect you have a perfect model that explains your data perfectly, respectively if you have you wouldn't want a probability anyway. You can bag these models together (create an ensemble) and for each outcome you can observe how many models predicted 1 and how many models predicted 0. The amount of models that predicted 1 divided by the number of models could then be interpreted as a probability that this outcome will be observed. If the models are all equally good then you can forget weighing between them, if they're different in quality of course you could factor these into your decision. Models with less quality on their training set are less likely to contribute to a good estimate.

So in summary you should attempt to apply GP e.g. 10 times and then use all 10 models on the training set to calculate their estimate (0 or 1). However, don't force yourself to GP only, there are many classification algorithms that can give good results.

As a sidenote, I'm part of the development team of a software called HeuristicLab which runs under Windows and with which you can run GP and create such ensembles. The software is open source.

OTHER TIPS

AI is all about complex algorithms. Think about it, the downside is very often, that these algorithms become black boxes. So the counterside to algoritms, such as NN and GA, are they are inherently opaque. That is what you want if you want to have a car driving itself. On the other hand this means, that you need tools to look into the black box.

What I'm saying is that GA is probably not what you want to solve your problem. If you want to solve AI types of problems, you first have to know how to use standard techniques, such as regression, LDA etc.

So, combining NN and GA is usually a bad sign, because you are stacking one black box on another. I believe this is bad design. An NN and GA are nothing else than non-linear optimizers. I would suggest to you to look at principal component analysis (PDA), SVD and linear classifiers first (see wikipedia). If you figure out to solve simple statistical problems move on to more complex ones. Check out the great textbook by Russell/Norvig, read some of their source code.

To answer the questions one really has to look at the dataset extensively. If you are working on a small problem, define the probabilities etc., and you might get an answer here. Perhaps check out Bayesian statistics as well. This will get you started I believe.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow