Question

It is documented in http://scikit-learn.org/0.9/modules/feature_selection.html "Warning Beware not to use a regression scoring function with a classification problem."

I am trying to find the best features for a regression problem and using f_regression as the scoring function. But it is extremely Memory hungry and my 8GB machine hangs and finally I get Memory error.

I have used Chi2 as a scoring function for the same problem and it works very fast. Wanted to know if the reverse of the warning is true ? If not can I use Chi2 as a scoring function for regression problem ?

Was it helpful?

Solution

The χ² test builds a contingency table of n_classes times n_features. In a regression model, there is no notion of n_classes. The only way to make it work would be to bin your y values, do feature selection, then train a regression model on the original y and the reduced feature set. There is no support for this in scikit-learn, so you'll have to program it yourself.

OTHER TIPS

No you should not use Chi2 scoring function as it has no proved guarantee to be accurate for regression model. You have to check your f_regression solution or use other solution like recursive elimination or PCA(Principle Component Analysis)

http://en.wikipedia.org/wiki/Principal_component_analysis

I personally would advice PCA, it gives very robust results.

I'd suggest you use LASSO if your problem is regression. Lasso is just standard regression with L1 regularization baked in; this has the effect of driving many feature weights to zero.

Scikit has an implementation of Lasso.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top