Pergunta

I'm working with a dataset that has 400 observations, 34 features and quite a few outliers, some of them extreme. Given the nature of my data, these need to be in the model.

I started by doing a 75-25 split on my data and leaving those 25% aside.

With the train set, I used GridSearchCV with a RepeatedKFold of 10 folds and 7 repeats and this returned my best_estimator results, which when we go in .cv_results_ we see it's the mean_test_score metric. I then called this my "Cross Validation score". Then, with this model fit, I ran it on the test set as grid.score(X_test, y_test) and called this my Test score.


def rf(df, score):

    X_train, X_test, y_train, y_test = train_test(df)

    params = {'n_estimators': [400, 700, 1000],
              'max_features': ['sqrt', 'auto'],
              'min_samples_split': [2, 3],
              'min_samples_leaf': [1, 2, 3],
              'max_depth': [50, 100, None],
              'bootstrap': [True, False]
}

    scorers = {'RMSE': make_scorer(rmse, greater_is_better=False),
               'MAE': make_scorer(mean_absolute_error, greater_is_better=False),
               'R2': make_scorer(r2_score)}

    cv = RepeatedKFold(n_splits=10, n_repeats=7)


    grid = GridSearchCV(estimator=RandomForestRegressor(random_state=random.seed(42)),
                              param_grid=params, 
                              verbose=1, 
                              cv=cv, 
                              n_jobs =-1, 
                              scoring=scorers, 
                              refit = score)

    grid = grid.fit(X_train, y_train)    

    print('Parameters used:', grid.best_params_)

    if score  == 'RMSE':
        print('RMSE score on train:', round(-1*grid.best_score_,4))
        print('RMSE score on test: ', round(-1*grid.score(X_test, y_test),4))

    elif score == 'R2':
        print('R Squared score on train:', round(grid.best_score_,4))
        print('R Squared score on test: ', round(grid.score(X_test, y_test),4))

    elif score == 'MAE':
        print('MAE score on train:', round(-1*grid.best_score_,4))
        print('MAE score on test: ', round(-1*grid.score(X_test, y_test),4))

When I set my metric to RMSE (the most important one), this is what it outputs:

RMSE score on train: 8.489
RMSE score on test: 5.7952

Have I done this correctly? Can I consider this discrepancy acceptable? With Random Forest for example, if I deliberately ignore the gridsearch parameters and set my min_leaf_node to something like 10, my RMSE goes all the way up to 12 but it becomes very similar between the CV score and my test data. I'm experiencing similar results with SVR and MLP algorithms.

This is part of my thesis and now I have my supervisor telling me I should be using all my data for cross-validation which I don't think is correct.

My conclusion is that given the outliers in the model, without more observations, a discrepancy in results is to be expected, however I don't know if this conclusion is right or if I'm doing something wrong here.

Running my model in a somewhat similar dataset with fewer outliers gives results closer to one another.

RMSE score on train: 5.9731
RMSE score on test: 6.9164

Nenhuma solução correta

Licenciado em: CC-BY-SA com atribuição
scroll top