Question

In the SMOTE paper here, the authors present the logic for creating synthetic examples when some of the features are nominal and some are continuous (section 6.1, SMOTE-NC).

This example is provided:

$F_1$ = 1 2 3 A B C [Let this be the sample for which we are computing nearest neighbors] $F_2$ = 4 6 5 A D E $F_3$ = 3 5 6 A B K So, Euclidean Distance between $F_2$ and $F_1$ would be:

$Eucl$ = $\sqrt{(4-1)^2 + (6-2)^2 + (5-3)^2 + Med^2 + Med^2}$

Med is the median of the standard deviations of continuous features of the minority class. The median term is included twice for feature numbers $5: B→D$ and $6: C→E$, which differ for the two feature vectors: $F_1$ and $F_2$.

The paper lacks explanation about why the nominal features should be affected by the continuous ones.

Can anyone provide such explanation? Did I miss it in the paper?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top