How to split data into 3 parts in Python - training(70%), validation(15%) and test(15%) and each part have similar target rate?
-
08-12-2020 - |
Question
I'm working on a company project which I will need to do data partition into 3 parts - Train, Validation, and Test(holdout).
Does anyone know how I can split the data into 3 parts above and each part will have similar response variable(target rate) - (similar accuracy for classification and similar mean of (response) for regression.
I know how to split data into 3 parts by using train_test_split function from SKLEARN
from sklearn.model_selection import train_test_split
x, x_test, y, y_test = train_test_split(xtrain,labels,test_size=0.2,train_size=0.8)
x_train, x_cv, y_train, y_cv = train_test_split(x,y,test_size = 0.25,train_size =0.75)
But this does not give a similar target rate, can someone help me?
Solution
For classification you can use the stratify
parameter:
stratify: array-like or None (default=None)
If not None, data is split in a stratified fashion, using this as the class labels.
See sklearn.model_selection.train_test_split. For example:
x, x_test, y, y_test = train_test_split(xtrain,labels,test_size=0.2, stratify=labels)
This will ensure the class distribution is similar between train and test data.
(side note: I have tossed the train_size
parameter since it will be automatically determined based on test_size
)
For regression there is, to my knowledge, no current implementation in scikit learn. But you can find a discussion and manual implementation here and here with regards to cross-validation.
OTHER TIPS
Split to a validation set it's not implemented in sklearn. But you could do it by tricky way:
1) At first step you split X and y to train and test set.
2) At second step you split your train set from previous step into validation and smaller train set.
X_train, X_test, y_train, y_test
= train_test_split(X, y, test_size=0.7, random_state=123)
X_train, X_val, y_train, y_val
= train_test_split(X_train, y_train, test_size=0.5, random_state=123)