Question

I am working on a classification task. The dataset is a UCI data set about machine learning with 200 observations and 2 classes.

Part of my model includes the following preprocessing steps:

  1. remove missing values
  2. normalize between 0 and 1
  3. remove outlier
  4. smoothing
  5. remove trend from data
  6. SMOTE

I would like to use a clustering method to remove noisy data points. The question is, at which step should this happen?

Was it helpful?

Solution

Looking at your different steps, the important thing to do is check which step would be affected by outliers.

  1. Removing missing values is not affected because this step is not dependent on other data points present (or not) in the dataset.
  2. However, normalizing your data is. Indeed, let's say your outliers contain extreme values, this will affect the normalized values of the non-outlier data points.

Therefore, intuitively, I would perform your noise removal at the very start or after step 1.

Ultimately, you should see what works better for your task. Perhaps removing outliers doesn't help as much as you'd expect. Same with your pre-processing. Feel free to experiment!

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top