GridSearchCV
has a cv
parameter that takes a cross-validation object; this must be an iterable that yields a pair of index arrays train_index, test_index
. The standard KFold
behaves as follows:
>>> from sklearn.cross_validation import KFold
>>> threefold = KFold(n=10, n_folds=3)
>>> for train, test in threefold:
... print("train: %r" % train)
... print("test: %r" % test)
...
train: array([4, 5, 6, 7, 8, 9])
test: array([0, 1, 2, 3])
train: array([0, 1, 2, 3, 7, 8, 9])
test: array([4, 5, 6])
train: array([0, 1, 2, 3, 4, 5, 6])
test: array([7, 8, 9])
So you have to mimick this somehow, by implementing a class
class CustomCV(object):
def __init__(self, ids, n_folds):
"""Pass an array of phenomenon ids"""
self.ids = ids
self.n_folds = n_folds
def __iter__(self):
for i in range(self.n_folds):
train = make_a_boolean_mask_for_the_training_set()
test = np.logical_not(train)
yield np.where(train)[0], np.where(test)[0]
where you'll have to fill in the logic of make_a_boolean_mask_for_the_training_set
yourself. If it helps, I have a variant for sequence data online.
Be sure to also set the GridSearchCV
parameter iid
to False
, or you'll get skewed results.