Is there an option to prevent the cross validation (and gridsearchCV) to randomize the rows of the dataset?

StackOverflow https://stackoverflow.com/questions/22151294

문제

anyone knows if there is a way to prevent the grid search function gridsearchCV in scikitlearn to randomize the record of my dataset?

I have group of rows which correspond to a same phenomenon and I would like to randomize on the phenomenon ID instead than of the whole rows. I managed to randomize already on the phenomenon with SQL, now I just would like gridsearchCV to not re-randomize before separating the dataset in train and test sets.

Example of my dataset:

id time feature1 feature2 feature3 feature4 
A 1 b c s a
A 2 b a s t
A 3 q w o j
B 1 l o j f
B 2 9 k l h
C 1 o k h u
C 2 o k h i
C 3 p j g d
D 1 l l d s
D 2 ...
D 3 ...
D 4 ...
D 5 ...

I wouldn't like a splitting on the IDs between training and test dataset.

Is there an option which could help me?

Thank you for your help.

도움이 되었습니까?

해결책

GridSearchCV has a cv parameter that takes a cross-validation object; this must be an iterable that yields a pair of index arrays train_index, test_index. The standard KFold behaves as follows:

>>> from sklearn.cross_validation import KFold
>>> threefold = KFold(n=10, n_folds=3)
>>> for train, test in threefold:
...    print("train: %r" % train)
...    print("test:  %r" % test)
...     
train: array([4, 5, 6, 7, 8, 9])
test:  array([0, 1, 2, 3])
train: array([0, 1, 2, 3, 7, 8, 9])
test:  array([4, 5, 6])
train: array([0, 1, 2, 3, 4, 5, 6])
test:  array([7, 8, 9])

So you have to mimick this somehow, by implementing a class

class CustomCV(object):
    def __init__(self, ids, n_folds):
        """Pass an array of phenomenon ids"""
        self.ids = ids
        self.n_folds = n_folds

    def __iter__(self):
        for i in range(self.n_folds):
            train = make_a_boolean_mask_for_the_training_set()
            test = np.logical_not(train)
            yield np.where(train)[0], np.where(test)[0]

where you'll have to fill in the logic of make_a_boolean_mask_for_the_training_set yourself. If it helps, I have a variant for sequence data online.

Be sure to also set the GridSearchCV parameter iid to False, or you'll get skewed results.

다른 팁

Configuring something that does more step at once is hard (and I'm not sure it's possible here) - maybe the specialized method gridSearchCV isn't right for you if you want to do something differently. So I suggest that you split up these steps, which aren't too complicated.

  1. Split the data for a cross validation to your liking, maybe using one of the sklearn.cross_validation methods.
  2. Do the grid search, maybe using sklearn.gridsearch.ParamaterGrid.
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top