XGBOOST - different result between train_test_split and manually splitting

https://datascience.stackexchange.com/questions/68794

09-12-2020
|

Question

I am trying to train XGBOOST model.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43, stratify=y)

when I'm using train_test_split and pass the model X_train, Y_train and for eval_set X_test, Y_test, The model seems to be a very good one.

CM example:

But when I manually split the Dataset :

splitValidationIndex = round(dataset.shape[0]*0.6)
splitTestIndex = round(dataset.shape[0]*0.8)
X_train = X[:splitValidationIndex]
y_train = y[:splitValidationIndex]

Pass it to fit

X_val = X[splitValidationIndex:splitTestIndex]
y_val = y[splitValidationIndex:splitTestIndex]

Pass it to eval_set

X_test = X[splitTestIndex:]
y_test = y[splitTestIndex:]

Check the model prediction on that

that produced a much worse model

example:

What am I missing/doing wrong?

Solution

All looks correct, but you have to get to know about some details.

train_test_split splits arrays or matrices into a random train and test subsets. If you not specifying random_state, you will get a different result, it means that every time when you run your code train and test datasets would have different values each time.

If you fixed specific value like random_state = 42 then yours data in test and train set conserve the same values.

So in the first way splitting, model is learning on the specific(always the same) chunk of data, which could give a good results, in second way every time model is learning on different chunks of data, so results could be worse(by the way, its looks like overfitting on chunk(in first way of split).

In general no matter what way you choose, but you have to use cross-validation to create a good performing model.

OTHER TIPS

Three things to check:

train_test_split by default randomly splits the dataset, while your manual split splits into contiguous chunks of indices. If your data is not shuffled and there is some meaning to the index order (e.g. if they are time-ordered, and then especially if there is concept drift in your data), then these can produce very different results.
You have set stratify=y, so train_test_split will split the dataset to have roughly equal proportions of each class in each fold. As above, if your dataset has a higher concentration of one class early compared to late, this will affect the testing significantly.
Maybe your data and/or model is fairly unstable, so that different splits just produce rather different results; in this case, rerunning everything after a new train_test_split (without specifying the seed) should give you rather different results too.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange