How to deal with a binary classification problem, where the instances in the negative class are very similar? [duplicate]

https://datascience.stackexchange.com/questions/86155

17-12-2020
|

Question

Let's say, one wants to detect, whether a picture of a fixed size contains a cat or not. But as a dataset, you have 10000 pictures of cats, and 30000 pictures which don't contain a cat, but are very similar to each other. For example, let's assume, the 30000 pictures in the "not cat" class contain only pictures of one or two kinds of spiders.

When training a CNN, you will find that you achieve a high score on the test set (here high score = almost fully diagonal confusion matrix) but when you want to use the CNN in the real world, you find that almost everything gets classified as a cat.

Why does the network generalize badly in this case? Even if the dataset doesn't represent the kind of data, the CNN would see in the real world, shouldn't it be easy for the CNN to say "I have seen 10000 examples of cats, therefore anything which doesn't look like a cat is not a cat" ?

How would one deal with this problem (besides gathering more data)?

Solution

The CNN in this case does not learn what is a cat but rather what differentiate an image with a cat from one without a cat.
If all of your "no-cat" images contain spider(s), the CNN could also converge only by detecting the spider(s) : images with spider(s) belong to "no-cat" and others belong to "cat". That explains why you have such a good confusion matrix on testing data and such poor performance in real-world.

You have some options to avoid such a situation:

add other images to your "no-cat" label as you said (using public datasets for instance)
try one-class classification approaches. If you are familiar with neural networks, you could dig into reconstruction error of autoencoders (see this post: How to use a dataset with only one category of data or this sklearn guide)

Hope it helps

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange