Can the use of EarlyStopping() offset overfitting problems caused by validation_split?

https://datascience.stackexchange.com/questions/85899

16-12-2020
|

Question

Keras gives users the option, while fitting a model, to split the data into train/test samples using the parameter "validation_split.

Example:

model = Sequential()
model.add(Dense(3, activation = 'relu'))

/// Compile model ///

model.fit(X_train, y_train, validation_split = 0.2)

However, my intuition suggests that using validation_split (as opposed to creating train, test samples before fitting the model) will cause overfitting, since although validation_split splits the batches into train and test at each epoch, the overall effect is that the entire dataset is 'seen' by the model.

I was wondering if:

my intuition is correct
assuming that 1) is true, if there are any circumstances where using the EarlyStopping() callback and validation_split would be better than splitting the data into train/test before fitting the model

Solution

The validation split parameter splits the data fed into .fit() into train and test sets. There is no mixing of the train and test sets after each epoch. So in terms of splitting, it behaves the same as sklearn's train_test_split(), the only difference being that by default, keras splits the data by index (so if you have validation_split = 0.2, the first 80% of indices are taken for training, the rest for testing).

So, in principle, it should not cause overfitting.

However, many times overfitting can occur based on how the model is evaluated. What I've seen many people do is the following:

# use previously created model

model.fit(X, y, validation_split = 0.2)
model.evaluate(X, y)

Here, the model would be overfitting because keras doesn't 'remember' the split that it used to train the data.

If you are using validation_split for visualisation purposes, an alternative would be to do the following:

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)
model.fit(X_train, y_train, validation_data = (X_test, y_test), callbacks = [callback])
model.evaluate(X_test, y_test)

This way, during training you can still get the test curves while the model is training, but at the end when using evaluate (or predict), you'll still be predicting data that is previously unseen from a training POV.

As for EarlyStopping(), it can be used here the same way it would be used with validation_split.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange