How do I deal with the fact that I have images which are not consistent with the class they belong in an image classification problem with CNN?

https://datascience.stackexchange.com/questions/86543

17-12-2020
|

Question

I am really new to Neural Networks and to Machine Learning in general, and I have been given a dataset composed by images for performing multi-class image classification with a CNN.

The images were already divided into classes, and looking at the images I have noticed that some of them are complitely different from the class they belong, for example If I have a class Fruits, with images of fruits, in the folder of this class I have some pictures of cars, people,..., which of course are not fruits and neither belong to any other class in the classification problem.

The problem is that this creates some problems when I train my CNN, and this results in a low accuracy, infact I cannot go above 0.5.

How do I deal with the fact that I have images which are not consistent with the class they belong?

Solution

There are different ways of doing this, but the final idea in regard to this is that you need to clean your dataset.

You could go through it manually and then separate the images. This is extremely slow if you're dealing with a large dataset.
A faster less robust way would be to do a Principle component analysis
- step1 : Do PCA to reduce the dimensionality to a 2D or 3D space
- step2 : Plot and see if there are any clusters. Usually, images which are different fall into different clusters
- step3 : Cluster it via a convex clustering algorithm like K-means
- step4 : Store the images belonging to a cluster into a folder.
- step5 : Go through the folders and make the necessary cleaning.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange