Improve a regression model and feature selection

https://datascience.stackexchange.com/questions/9505

16-10-2019
|

Question

I am working on Azure ML Studio and try to create a regression model to predict a numerical value. I will try to describe my features and what I have done until now.

My data with about 3 million rows :

Features:

8 integer features from 1 to 25
2 boolean features with 0 and 1
3 integer features from 1 to 10
2 integer feature from 0 to 500.000 (and 1.000.000 respectively) with about 4.500 unique values
1 integer feature from 20 to 50
1 integer feature from 1 to 15
1 integer feature from 0 to 100

Label:

Integer from 10.000 to 100.000.000 with about 5.000 unique values

What I have done:

Split the dataset to 80% (train) and 20% (test). Then I split the training dataset again to 60% (actual train) and 40% (validation).
Normalize the features with many unique values (4th bullet in the above list)
Train a model of Boosted Decision Tree Regression.
Use the Sweep Parameters module to find the best combination

I tried also Neural Networks, Bayesian Linear Regression, but BDTR gave the best score.

I tried to exclude columns and start with only a few (based on what I think it will affect the model) and then add more columns one by one.

However, the least MSE I could achieved was 1.500.000 (plus I had many negative scored values)

So, I was thinking what other techniques I could use to improve the model.

Solution

I agree with @Hoap. Your features might be low for the amount of training observations you have. Instead of excluding columns, see if you're missing more features. Feature Engineering, rather than Feature Selection.
However, if you are looking for Feature Selection, then Azure ML has a Feature Selection Module with the option to specify how many features you'd like to keep.

Some simple verifications to do before you jump into modeling:

Visualize your dataset for any non-linear relationships.
You could also perform a simple correlation analysis to check for multi-collinearity.
I also think that normalizing all of your data between 0 to 1 for consistent comparable values between features would be helpful.

Hopefully one of these will show some unexpected pattern in your data. I apologize if you've already performed these checks. Just wanted to put them out there.

Looks like you pretty much used every regression model in the Azure ML library.

OTHER TIPS

I think the next option you have to take is to add more features. You have a huge amount of training examples, which is good, but the number of features is very low. Adding more features is one of the most used methods to improve performance in machine learning.

Furthermore, it would be good to try to understand how your features affect your model. Imagine you have a linear model like y = theta1*feat1 + theta2*feat2 + theta3*feat3. If theta3 is near to 0, then feat3 is not affecting the model.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange