How to choose best model for Regression?

https://datascience.stackexchange.com/questions/73193

10-12-2020
|

Question

I'm building a model to predict the flight delay. My dataset contains the following columns:

FL_DATE (contains months(1-12)), OP_CARRIER (One hot encoded data of Carrier names), ORIGIN(One hot encoded data of Origin Airport), Dest(one-hot encoded data of Dest Airport), CRS_DEP_TIME(Intended time of departure ex: 1015), DEP_TIME(Actual time of departure ex: 1017),DEP_DELAY(the difference between crs-dep ex: -2), ARR_DELAY(arrival delay ex: -2)

My target variable is ARR_DELAY. After checking my data, I have decided it is a regression problem. However, I'm not sure what method do I need to use for selecting the appropriate columns. On the other hand, I was plotting each column with ARR_DELAY to check their relation and got something like this: FL_TIME vs ARR_DELAY

In such scenarios, if I have to build a model for such data which regression technique should I use?

PS: I'm new to Machine Learning. Please correct me If I'm heading in the wrong direction

Solution

One of my favorite tools for feature selection is the Random Forest. Consider giving it a try:
https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html

Random Forests are generally classifiers but they have the added advantage of being able to find and measure the variable's importance. It can be a long process but helpful. By using 'Gini importance' as a measure, RF can provide a bar chart that gives a relative comparison of the features with respect to which feature is best at separating signal from the noise. Check out this further discussion:
https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#workings

OTHER TIPS

I think a better option will be that you try different regression techniques (it will be good for your learning curve) and see advantages and disadvantage.

For subset selection purposes few options are:

1- Best subset selection : in which you start with a null model and fill all possible combination of models and select best model on the basis of AIC, BIC or adjusted R2.

2-Forward selection: You start with a null model as well and add feature one by one to see which one gives lowest RSS or highest R2. Then add another feature again until you have the best model.

3-Backward selection: you start with a model with all p predictors remove one predictor evaluate adjusted R2 or AIC, BIC. do it again and again until you have a satisfactory model.

You don't have to do it manually both R and python have library that will help you in this.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange