How to split data into 3 parts in Python - training(70%), validation(15%) and test(15%) and each part have similar target rate?

https://datascience.stackexchange.com/questions/67489

08-12-2020
|

Question

I'm working on a company project which I will need to do data partition into 3 parts - Train, Validation, and Test(holdout).

Does anyone know how I can split the data into 3 parts above and each part will have similar response variable(target rate) - (similar accuracy for classification and similar mean of (response) for regression.

I know how to split data into 3 parts by using train_test_split function from SKLEARN

from sklearn.model_selection import train_test_split

x, x_test, y, y_test = train_test_split(xtrain,labels,test_size=0.2,train_size=0.8)
x_train, x_cv, y_train, y_cv = train_test_split(x,y,test_size = 0.25,train_size =0.75)

But this does not give a similar target rate, can someone help me?

Solution

For classification you can use the stratify parameter:

stratify: array-like or None (default=None)

If not None, data is split in a stratified fashion, using this as the class labels.

See sklearn.model_selection.train_test_split. For example:

x, x_test, y, y_test = train_test_split(xtrain,labels,test_size=0.2, stratify=labels)

This will ensure the class distribution is similar between train and test data. (side note: I have tossed the train_size parameter since it will be automatically determined based on test_size)

For regression there is, to my knowledge, no current implementation in scikit learn. But you can find a discussion and manual implementation here and here with regards to cross-validation.

OTHER TIPS

Split to a validation set it's not implemented in sklearn. But you could do it by tricky way:

1) At first step you split X and y to train and test set.

2) At second step you split your train set from previous step into validation and smaller train set.

X_train, X_test, y_train, y_test 
    = train_test_split(X, y, test_size=0.7, random_state=123)

X_train, X_val, y_train, y_val 
    = train_test_split(X_train, y_train, test_size=0.5, random_state=123)

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange