Pregunta

I'm a beginner to using statsmodels & I'm also open to using other Python based methods of solving my problem:

I have a data set with ~ 85 features some of which are highly correlated. When I run the OLS method I get a helpful 'strong multicollinearity problems' warning as I might expect.

I've previously run this data through Weka, which as part of the regression classifier has an eliminateColinearAttributes option.

How can I do the same thing - get the model to chose which attributes to use instead of having them all in the model? Thanks!

¿Fue útil?

Solución

To run multivariate regression use scipy.stats.linregress. Check out this nice example which has a good explanation.

The eliminateColinearAttributes option in the software you've mentioned is just some algorithm implemented in this software to fight the problem. Here, you need to implement some iterative algorithm yourself based on elimination of one of highly correlated variables with the highest p-value (then run regression again and repeat until multicollinearity is not there).

There's no one and only way here, there are different techniques. It is also a good practice to choose manually from the set of highly correlated with each other set of variables which to omit that it also makes sense.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top