Track underlying observation when using GridSearchCV and make_scorer

https://datascience.stackexchange.com/questions/57922

02-11-2019
|

Pergunta

I'm doing a GridSearchCV, and I've defined a custom function (called custom_scorer below) to optimize for. So the setup is like this:

gs = GridSearchCV(estimator=some_classifier,
                  param_grid=some_grid,
                  cv=5,  # for concreteness
                  scoring=make_scorer(custom_scorer))

gs.fit(training_data, training_y)

This is a binary classification. So during the grid search, for each permutation of hyperparameters, the custom score value is computed on each of the 5 left-out folds after training on the other 4 folds.

custom_scorer is a scaler-valued function with 2 inputs: an array $y$ containing ground truths (i.e., 0's and 1's), and an array $y_{pred}$ containing predicted probabilities (of being 1, the "positive" class):

def custom_scorer(y, y_pred):
    """
    (1) y contains ground truths, but only for the left-out fold
    (2) Similarly, y_pred contains predicted probabilities, but only for the left-out fold
    (3) So y, y_pred is each of length ~len(training_y)/5
    """

    return scaler_value

But suppose the scaler_value returned by custom_scorer depends not only on $y$ and $y_{pred}$, but also knowledge of which observations were assigned to the left-out fold. If I have only $y$ and $y_{pred}$ (again: the ground truths and predicted probabilities for the left-out fold, respectively) when the custom_scorer method is called, I don't know which rows belong to this fold. I need a way to track which rows of training_data get assigned to the left-out fold at the point when custom_scorer is called, e.g. the indices of the rows.

Any ideas on the easiest way to do this? Please let me know if clarification is needed. Thank you!

Nenhuma solução correta

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange