Why cant I overfit this dataset with my neural network?

https://datascience.stackexchange.com/questions/77662

12-12-2020
|

Question

I have read that given a model is complex enough and I train for enough epochs, my model should at some point overfit the dataset. However I implemented a simple neural network in keras and my validation set loss seems to never go up:

import tensorflow as tf
from tensorflow import keras
import numpy as np
import random
from sklearn import datasets, preprocessing
import matplotlib.pyplot as plt


# import and scale
dataset = datasets.load_boston()
X = dataset.data
y = dataset.target
X = preprocessing.scale(X)
y = y.reshape((y.shape[0], 1))

# shuffle
shuffle_indices = list(range(X.shape[0]))
random.shuffle(shuffle_indices)
X = X[shuffle_indices]
y = y[shuffle_indices]

# tain-validation split
X_train, y_train  = X[:int(X.shape[0] * 0.7)], y[:int(X.shape[0] * 0.7)]
X_val, y_val = X[int(X.shape[0] * 0.7):], y[int(X.shape[0] * 0.7):]

# define and fit model
model = keras.Sequential([keras.layers.Dense(X.shape[1], use_bias=True, activation="sigmoid"),
                              keras.layers.Dense(128, use_bias=True, activation="sigmoid"),
                              keras.layers.Dense(128, use_bias=True, activation="sigmoid"),
                              keras.layers.Dense(128, use_bias=True, activation="sigmoid"),
                              keras.layers.Dense(128, use_bias=True, activation="sigmoid"),
                              keras.layers.Dense(128, use_bias=True, activation="sigmoid"),
                              keras.layers.Dense(128, use_bias=True, activation="sigmoid"),
                              keras.layers.Dense(y.shape[1])
                         ])
model.compile(optimizer=tf.keras.optimizers.SGD(
    learning_rate=0.0001
), loss='MeanSquaredError')

model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=500, verbose=1)

# plot data
plt.plot(range(1, len(model.history.history['loss']) + 1), model.history.history['loss'], label='Train Set Cost')
plt.plot(range(1, len(model.history.history['val_loss']) + 1), model.history.history['val_loss'], label='Validation Set Cost')
plt.xlabel("epoch")
plt.ylabel("loss")
plt.legend()
plt.show()

The model is a simple dense neural network with Mean Squared Error as its loss function and gradient descent as it's optimizer. I tried to make the network deeper, but the validation loss only keeps decreasing until it stops at one point.

Solution

It is usually good to start with a small model because you can then evaluate the contribution of adding layers, etc. Also, Boston dataset is a popular dataset so there are several tutorials showing good neural network architectures, like this one. Concerning your model, here are some notes.

The use of sigmoid activation is likely to worsen results, since sigmoid function compresses values between 0 and 1, while you are trying to predict outputs between 5 and 50.
Instead of sigmoid, you can use ReLU activation, which has better convergence properties for inner layers
You can standardize your target data to reduce the variance of your data and control the mean, this usually improve a lot regression models. In this case, a sigmoid activation on your final layer would be a good choice

OTHER TIPS

So, an interesting question asking why a complex model such as the one you have illustrated above is not overfitting (interesting to hear why you would want to achieve this).

Firstly, to make sure we are on the same page, overfitting is typically seen when the training loss decrease (accuracy increases), as validation loss remains the same or increases. So, it is important that note that overfitting is made visible by comparing the trajectories of both training and validation losses (accuracies, etc.).

In response to your question, overfitting happens when it adjusts the parameters such that it fits to the training examples, so therefore examples which are similar to the training examples would theoretically be correctly classified. Where overfitting is really the problem is for when there are values which go beyond the range given in the training examples.

Therefore, one possible reason as to why there might be no explicit sign of overfitting could be that the validation data is very similar to the training data, such that for each dimension, the values from the validation set could be within the range in the training set? Might be worth checking this out by checking feature distributions in both training and validation sets.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange