Predictive analysis of rare events

https://datascience.stackexchange.com/questions/8646

16-10-2019
|

Question

I'm trying to predict rare events, meaning less than 1% of positive cases. I basically try to predict if a subject will have 0, 1, 2 ... , 6, > 6 failures (there are cases in all those categories).

I've tried several algorithms:

decision trees
random forest
adaboost
grouping using k-means clustering and finding associations with failures (which group has most failure)

In any case, learning either goes to no failure or has too much variance (leading poor reasults on C.V. set).

Do you know any machine learning algorithms which are better suited for rare events?

Or is it surprising that I get those bad results using those algorithms, which means that my features list is not good?

Thanks a lot.

Solution

When you have an unbalanced data set, the algorithm is going to weight its success on each data point equally, meaning the majority class comes out as much more important than the minority class. The typical solution is to sample down the majority class until it's the same size as the minority class, and an alternate (similar) solution is to adjust the cost function so that the minority class is weighted appropriately.

See these similar questions for more:

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange