Extremely high MSE/MAE for Ridge Regression(sklearn) when the label is directly calculated from the features

https://datascience.stackexchange.com/questions/69852

09-12-2020
|

Question

Edit: Removing TransformedTargetRegressor and adding more info as requested.

Edit2: There were 18K rows where the relation did not hold. I'm sorry :(. After removing those rows and upon @Ben Reiniger's advice, I used LinearRegression and the metrics looked more saner. The new metrics are pasted below.

Original Question:

Given totalRevenue and costOfRevenue, I'm trying to predict grossProfit. Given that it's a simple formula totalRevenue - costOfRevenue = grossProfit, I was expecting that the following code would work. Is it a matter of hyperparameter optimization or have I missed some data cleaning. I have tried all the scalers and other regressions in sklearn but I don't see any big difference.

# X(107002 rows × 2 columns)
+--------------+---------------+
| totalRevenue | costOfRevenue |
+--------------+---------------+
| 2.256510e+05 | 2.333100e+04  |
| 1.183960e+05 | 2.857000e+04  |
| 2.500000e+05 | 1.693000e+05  |
| 1.750000e+05 | 8.307500e+04  |
| 3.905000e+09 | 1.240000e+09  |
+--------------+---------------+

# y
+--------------+
| 2.023200e+05 |
| 8.982600e+04 |
| 8.070000e+04 |
| 9.192500e+04 |
| 2.665000e+09 |
+--------------+
Name: grossProfit, Length: 107002, dtype: float64

# Training


import numpy as np
import sklearn

from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline



X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=13)

x_scaler = StandardScaler()

pipe_l = Pipeline([
        ('scaler', x_scaler),
        ('regressor', Ridge())
        ])


regr = pipe_l

regr.fit(X_train, y_train)

y_pred = regr.predict(X_test)

print('R2 score: {0:.2f}'.format(sklearn.metrics.r2_score(y_test, y_pred)))
print('Mean Absolute Error:', sklearn.metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', sklearn.metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(sklearn.metrics.mean_squared_error(y_test, y_pred)))


print("Scaler Mean:",x_scaler.mean_)
print("Scaler Var:", x_scaler.var_)
print("Estimator Coefficient:",regr.steps[1][1].coef_)

Output of above metrics after training(Old Metrics with 18k rows which did not confirm to the relation)

R2 score: 0.69
Mean Absolute Error: 37216342513.01034
Mean Squared Error: 7.601569571667974e+23
Root Mean Squared Error: 871869805169.7842
Scaler Mean: [1.26326695e+13 2.14785735e+14]
Scaler Var: [1.24609190e+31 2.04306993e+32]
Estimator Coefficient: [1.16354874e+15 2.59046205e+09]

Ridge(After removing the 18k bad rows)


R2 score: 1.00
Mean Absolute Error: 15659273.260432156
Mean Squared Error: 8.539990125466045e+16
Root Mean Squared Error: 292232614.97420245
Scaler Mean: [1.57566809e+11 9.62274405e+10]
Scaler Var: [1.20924187e+25 5.95764210e+24]
Estimator Coefficient: [ 3.47663586e+12 -2.44005648e+12]

LinearRegression(After removing the 18K rows)

R2 score: 1.00
Mean Absolute Error: 0.00017393178061611583
Mean Squared Error: 4.68109129068828e-06
Root Mean Squared Error: 0.0021635829752261132
Scaler Mean: [1.57566809e+11 9.62274405e+10]
Scaler Var: [1.20924187e+25 5.95764210e+24]
Estimator Coefficient: [ 3.47741552e+12 -2.44082816e+12]

Solution

(To summarize the comment thread into an answer)

Your original scores:

Mean Absolute Error: 37216342513.01034
Root Mean Squared Error: 871869805169.7842

are based on the original-scale target variable and are between $10^{10}$ and $10^{12}$, at least significantly smaller than the mean of the features (and the target)? So these aren't automatically bad scores, although for a perfect relationship we should hope for better. Furthermore, a 0.69 R2 values is pretty low, no scale-consciousness needed.

That both of the model's coefficients came out positive is the most worrisome point. I'm glad you identified the culprit rows; I don't know how I would have diagnosed that from here.

Your new ridge regression still has "large" errors, but significantly smaller than before, and quite small compared to the feature/target scale. And now the coefficients have different signs. (I think if you'd left the TransformedTargetRegressor in, you'd get largely the same results, but with less penalization.)

Finally, when such an exact relationship is the truth, it makes sense not to penalize the regression. Your coefficients here are a little bit larger, and the errors drop away to nearly nothing, especially considering the scale of the target.

OTHER TIPS

It seems you are using a the standard scaler twice, once in your pipeline and once more in the TransformedTargetRegressor. Next to that, you are only fitting the scaler, never actually scaling the inputs (i.e. transforming the input).

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange