Overfitting due to features correlating with training set generation rules

https://datascience.stackexchange.com//questions/64201

06-12-2019
|

Question

As background, I am using a Deep Neural Network built using Keras to classify inputs into 5 categories.

The current structure of the network is:

Input layer (~450 nodes)
Dense layer (750 nodes)
Dropout layer (750 nodes, dropout rate = 0.5)
Dense layer (5 nodes)

The issue I'm having if one of overfitting. My model performs well on the held-out test set (a proportion of my training set), with accuracy sitting right around 99%. However, when I look to apply the model onto unlabelled data, it is only able to classify ~67% of observations into any category, before even considering the correctness of those classifications!

I think the issue may be around my feature and training set generation process. I generated the training set using a rules-based string matching method. This generated a training set of around 3.6 million observations (10% of population).

However, one of the largest features for my input layer is an embedding of the same text used to generate the training set. Therefore, the words matched to generate the training set are also embedded and used as features. Worth noting the text is around 140 characters per observation, and I matched bigrams from that text (so there is other information in that text that would be useful as a feature).

I would remove this feature altogether, however this is the richest information associated with each observation.

Is there a way to solve this without removing that feature altogether?

Hope this makes sense and happy to provide more clarification.

Simplified explanation:

My model performs well on my training and test sets.
Performs badly on unlabelled data.
Each observation is associated with a block of text.
To label my training set I used string matching on that text.
The text is also a feature (embedding).
Is this causing my poor performance on unlabelled data (model learning those string matches?).
If so what can I do?

EDIT: Also happy to hear if you think the issue is something else.

Solution

I would say that there are no "problems" in the sense what is happening is to be expected.

First of all, here is a key ML reminder which somehow often gets lost:

Performing well on the test set is pointless if that test set is not representative (i.e. is not similar to unlabeled instances)
Adding more training data does not help if that training data isn't covering new situations

You say you created your training set using string matching rules, I assume that these rules are similar to " "mortgage repayment" was a "housing" observation." as you pointed out in the comments.

Since your model considers bigrams as input, it is not surprising that it found a way to reverse-engineer which string matching rules you used.

To improve your model, I would look at what mistakes your model currently makes on unlabeled data, this should provide you with training instances that do not fit your string matching rules.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange