What is GridSearchCV doing after it finishes evaluating the performance of parameter combinations that takes so long?

https://datascience.stackexchange.com/questions/45810

01-11-2019
|

Pergunta

I'm running GridSearchCV to tune some parameters. For example:

params = {
    'max_depth':[18,21]
}

gscv = GridSearchCV(
    xgbc,
    params,
    scoring='roc_auc',
    verbose=50,
    cv=StratifiedKFold(n_splits=2, shuffle=True,random_state=42)
)

gscv.fit(df.drop('LAPSED', axis=1), df.LAPSED)
print('best score: ', gscv.best_score_, 'best params: ', gscv.best_params_)

All fine. Because I've specified some verbosity, it outputs some stuff about what it's doing, like this:

Fitting 2 folds for each of 2 candidates, totalling 4 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] max_depth=18 ....................................................
[CV] ........... max_depth=18, score=0.9453140690301272, total= 8.2min
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  8.3min remaining:    0.0s
[CV] max_depth=18 ....................................................
[CV] ........... max_depth=18, score=0.9444119097669363, total= 7.9min
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed: 16.3min remaining:    0.0s
[CV] max_depth=21 ....................................................
[CV] ........... max_depth=21, score=0.9454705777130412, total= 8.4min
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed: 24.8min remaining:    0.0s
[CV] max_depth=21 ....................................................
[CV] ........... max_depth=21, score=0.9443863821843195, total= 8.3min
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed: 33.2min remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed: 33.2min finished

However once it's finished running all the folds, it takes a very long time (at least as long as it takes to fit and evaluate one fold for one combination of parameters) for it to return the output to print('best score: ', gscv.best_score_, 'best params: ', gscv.best_params_), even though I could literally manually calculate that easily from the data it outputs as part of that fitting process. I presume this means that the algorithm is hung up doing something else after it finishes fitting and evaluating the different models, but I'm not sure what that might be.

n.b. actually, the thought just occurred to me that this might be time spent retraining the model on the parameters it identified as giving the best performance, so that it's available to the .predict() etc methods. I'm just checking that now by passing refit=False to prevent that from happening, and if it works I'll answer my own question.

Nenhuma solução correta

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange