Is there an option to prevent the cross validation (and gridsearchCV) to randomize the rows of the dataset?

StackOverflow https://stackoverflow.com/questions/22151294

Question

anyone knows if there is a way to prevent the grid search function gridsearchCV in scikitlearn to randomize the record of my dataset?

I have group of rows which correspond to a same phenomenon and I would like to randomize on the phenomenon ID instead than of the whole rows. I managed to randomize already on the phenomenon with SQL, now I just would like gridsearchCV to not re-randomize before separating the dataset in train and test sets.

Example of my dataset:

id time feature1 feature2 feature3 feature4 
A 1 b c s a
A 2 b a s t
A 3 q w o j
B 1 l o j f
B 2 9 k l h
C 1 o k h u
C 2 o k h i
C 3 p j g d
D 1 l l d s
D 2 ...
D 3 ...
D 4 ...
D 5 ...

I wouldn't like a splitting on the IDs between training and test dataset.

Is there an option which could help me?

Thank you for your help.

Was it helpful?

Solution

GridSearchCV has a cv parameter that takes a cross-validation object; this must be an iterable that yields a pair of index arrays train_index, test_index. The standard KFold behaves as follows:

>>> from sklearn.cross_validation import KFold
>>> threefold = KFold(n=10, n_folds=3)
>>> for train, test in threefold:
...    print("train: %r" % train)
...    print("test:  %r" % test)
...     
train: array([4, 5, 6, 7, 8, 9])
test:  array([0, 1, 2, 3])
train: array([0, 1, 2, 3, 7, 8, 9])
test:  array([4, 5, 6])
train: array([0, 1, 2, 3, 4, 5, 6])
test:  array([7, 8, 9])

So you have to mimick this somehow, by implementing a class

class CustomCV(object):
    def __init__(self, ids, n_folds):
        """Pass an array of phenomenon ids"""
        self.ids = ids
        self.n_folds = n_folds

    def __iter__(self):
        for i in range(self.n_folds):
            train = make_a_boolean_mask_for_the_training_set()
            test = np.logical_not(train)
            yield np.where(train)[0], np.where(test)[0]

where you'll have to fill in the logic of make_a_boolean_mask_for_the_training_set yourself. If it helps, I have a variant for sequence data online.

Be sure to also set the GridSearchCV parameter iid to False, or you'll get skewed results.

OTHER TIPS

Configuring something that does more step at once is hard (and I'm not sure it's possible here) - maybe the specialized method gridSearchCV isn't right for you if you want to do something differently. So I suggest that you split up these steps, which aren't too complicated.

  1. Split the data for a cross validation to your liking, maybe using one of the sklearn.cross_validation methods.
  2. Do the grid search, maybe using sklearn.gridsearch.ParamaterGrid.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top