Remove noise by clustering on which step of pre-processing is better?

https://datascience.stackexchange.com//questions/64232

06-12-2019
|

Question

I am working on a classification task. The dataset is a UCI data set about machine learning with 200 observations and 2 classes.

Part of my model includes the following preprocessing steps:

remove missing values
normalize between 0 and 1
remove outlier
smoothing
remove trend from data
SMOTE

I would like to use a clustering method to remove noisy data points. The question is, at which step should this happen?

Solution

Looking at your different steps, the important thing to do is check which step would be affected by outliers.

Removing missing values is not affected because this step is not dependent on other data points present (or not) in the dataset.
However, normalizing your data is. Indeed, let's say your outliers contain extreme values, this will affect the normalized values of the non-outlier data points.

Therefore, intuitively, I would perform your noise removal at the very start or after step 1.

Ultimately, you should see what works better for your task. Perhaps removing outliers doesn't help as much as you'd expect. Same with your pre-processing. Feel free to experiment!

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange