Do i need to use hyperparamters from Gridsearch to train on WHOLE training set to get final model?

https://datascience.stackexchange.com/questions/67714

08-12-2020
|

Pergunta

I just want to make sure i am on the right lines so please correct me if wrong. I am testing which hyperparmets are best for logisitic regession on my data X, y where X is featrues and y is target. X, y are made from my training set. I also have a test set.

from sklearn.linear_model import LogisticRegression

# split train into target and features 
    y = Train['target']
    X = Train.drop(['target'], axis = 1)
    X = pd.get_dummies(X)
#split test data into target and features 

y_test = Test['target']
X_test = Test.drop(['target'], axis = 1)
X_test = pd.get_dummies(X_test)


logistic = LogisticRegression()  # initialize the model
# Create regularization penalty space

    param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }

clf=GridSearchCV(logistic,param_grid=param_grid,cv=5)



best_model = clf.fit(X, y)# View best hyperparameters
print('Best Penalty:', best_model.best_estimator_.get_params()['penalty'])
print('Best C:', best_model.best_estimator_.get_params()['C']) #

I will now use these hyper parameters and 'train' it on my training data. Just so i'm sure when we say train do i then take best_model and train on the whole X data. Then i use my X_test which is my hold out data on this new trained model:

bestLog=best_model.best_estimator_
trained_model=bestLog.fit(X,y)
predicted=trained_model.predict(X_test)

then use this output above as my final model to test?

Solução

As far as I understand (disclaimer: I'm not very familiar with Python) your approach is correct: the selected hyper-parameters are tested on the hold out test set which is different from the training set, this way there's no data leakage and you can evaluate the true performance of your model before applying it to the test set.

For analysis purposes it could be useful to compare the performance of the best model on X (training set) and X_test (hold out) in order to check for overfitting.

Note that in a case like this where you directly select the best hyper-parameters I would consider it acceptable to skip the testing on the hold out set, however in this case you wouldn't know the true performance of your model (so for instance you wouldn't be able to check if it's overfit). To be clear: I don't think you should do this, it's just a remark to show the difference with/without hold out set.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange