How to deal with a binary classification problem, where the instances in the negative class are very similar? [duplicate]

datascience.stackexchange https://datascience.stackexchange.com/questions/86155

Question

Let's say, one wants to detect, whether a picture of a fixed size contains a cat or not. But as a dataset, you have 10000 pictures of cats, and 30000 pictures which don't contain a cat, but are very similar to each other. For example, let's assume, the 30000 pictures in the "not cat" class contain only pictures of one or two kinds of spiders.

When training a CNN, you will find that you achieve a high score on the test set (here high score = almost fully diagonal confusion matrix) but when you want to use the CNN in the real world, you find that almost everything gets classified as a cat.

Why does the network generalize badly in this case? Even if the dataset doesn't represent the kind of data, the CNN would see in the real world, shouldn't it be easy for the CNN to say "I have seen 10000 examples of cats, therefore anything which doesn't look like a cat is not a cat" ?

How would one deal with this problem (besides gathering more data)?

Was it helpful?

Solution

The CNN in this case does not learn what is a cat but rather what differentiate an image with a cat from one without a cat.
If all of your "no-cat" images contain spider(s), the CNN could also converge only by detecting the spider(s) : images with spider(s) belong to "no-cat" and others belong to "cat". That explains why you have such a good confusion matrix on testing data and such poor performance in real-world.

You have some options to avoid such a situation:

Hope it helps

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top