Question

I am doing an exercise of a Machine Learning System module in python that takes a dataset of cars (cylinders, year, consumption....) and asks for a model, being the variable to predict the consumption of gasoline. As it has three categorical variables, I have generated the dummies.

enter image description here

In the exercise I need to eliminate the variables with multicollinearity, so I used the method showed on my course notes:

from sklearn.linear_model import LinearRegression

def calculateVIF(data):
    features = list(data.columns)
    num_features = len(features)

    model = LinearRegression()

    result = pd.DataFrame(index = ['VIF'], columns = features)
    result = result.fillna(0)

    for ite in range(num_features):
        x_features = features[:]
        y_featue = features[ite]
        x_features.remove(y_featue)

        x = data[x_features]
        y = data[y_featue]

        model.fit(data[x_features], data[y_featue])

        result[y_featue] = 1/(1 - model.score(data[x_features], data[y_featue]))

    return result

Then if I launch the method it calculates a coefficient for each variable:

enter image description here

In my course notes it is said:

  • $VIF>5$ is a high value.
  • $VIF>10$ is a very high value

What should I do? I need to remove the variables that have a $VIF>10$ before executing the model?

The problem I see, for my categorical variable cylinders, is only cylinders_5 has a VIF under 10 so should I remove the others and leave cyclinders_5?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top