Question

I am trying to solve classification problem using Matlab GPTIPS framework. I managed to build reasonable data representation and fitness function so far and got an average accuracy per class near 65%.

What I need now is some help with two difficulties:

  1. My data is biased. Basically I am solving binary classification problem and only 20% of data belongs to class 1, while other 80% belong to class 0. I used accuracy of prediction as my fitness function at first, but it was really bad. The best I have now is

    Fitness = 0.5*(PositivePredictiveValue + NegativePredictiveValue) - const*ComplexityOfSolution

Please, advize, how can I improve my function to make correction for data bias.

  1. Second problem is overfitting. I divided my data into three parts: training (70%), testing (20%), validation (10%). I train each chromosome on training set, then evaluate it's fitness function on testing set. This routine allows me to reach fitness of 0.82 on my test data for the best individual in population. But same individual's result on validation data is only 60%. I added validation check for best individual each time before new population is generated. Then I compare fitness on validation set with fitness on test set. If difference is more then 5%, then I increase penalty for solution complexity in my fitness function. But it didn't help. I could also try to evaluate all individuals with validation set during each generation, and simply remove overfitted ones. But then I don't see any difference between my test and validation data. What else can be done here?

UPDATE:

For my second question I've found great article "Experiments on Controlling Overtting in Genetic Programming" Along with some article authors' ideas on dealing with overfitting in GP it has impressive review with a lot of references to many different approaches to the issue. Now I have a lot of new ideas I can try for my problem. Unfortunately, still cant' find anything on selecting a proper fitness function which will take into account unbalanced class proportions in my data.

Was it helpful?

Solution

65% accuracy is very bad when the baseline (classify everything as the class with most samples) would be 80%. You need to achieve at least baseline classification in order to have a better model than the naive one.

I would not penalize complexity. Rather limit the tree size (if possible). You could identify simpler models during the run, like storing a pareto front of models with quality and complexity as its two fitness values.

In HeuristicLab we have integrated GP based classification that can do these things. There are several options: You can choose to use MSE for classification or R2. In the latest trunk build there is also an evaluator to optimize accuracy directly (exactly speaking it optimizes the classification penalties). Optimizing MSE means it assigns each class a value (1, 2, 3,...) and tries to minimize mean squared error from that value. This may not seem optimal at first, but works. Optimizing accuracy directly may lead to faster overfitting. There is also a formula simplifier which allows you to prune and shrink your formula (and view the effects of that).

Also, does it need to be GP? Have you tried Random Forest Classification or Support Vector Machines as well? RF are pretty fast and work pretty well usually.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top