I'm doing a GridSearchCV, and I've defined a custom function (called custom_scorer below) to optimize for. So the setup is like this:

gs = GridSearchCV(estimator=some_classifier,
                  param_grid=some_grid,
                  cv=5,  # for concreteness
                  scoring=make_scorer(custom_scorer))

gs.fit(training_data, training_y)

This is a binary classification. So during the grid search, for each permutation of hyperparameters, the custom score value is computed on each of the 5 left-out folds after training on the other 4 folds.

custom_scorer is a scaler-valued function with 2 inputs: an array $y$ containing ground truths (i.e., 0's and 1's), and an array $y_{pred}$ containing predicted probabilities (of being 1, the "positive" class):

def custom_scorer(y, y_pred):
    """
    (1) y contains ground truths, but only for the left-out fold
    (2) Similarly, y_pred contains predicted probabilities, but only for the left-out fold
    (3) So y, y_pred is each of length ~len(training_y)/5
    """

    return scaler_value

But suppose the scaler_value returned by custom_scorer depends not only on $y$ and $y_{pred}$, but also knowledge of which observations were assigned to the left-out fold. If I have only $y$ and $y_{pred}$ (again: the ground truths and predicted probabilities for the left-out fold, respectively) when the custom_scorer method is called, I don't know which rows belong to this fold. I need a way to track which rows of training_data get assigned to the left-out fold at the point when custom_scorer is called, e.g. the indices of the rows.

Any ideas on the easiest way to do this? Please let me know if clarification is needed. Thank you!

没有正确的解决方案

许可以下: CC-BY-SA归因
scroll top