Pre-processing mixed data prior to clustering

https://datascience.stackexchange.com/questions/69585

09-12-2020
|

Question

I am new to hierarchical clustering, and wish to perform clustering on mixed data. I am slightly confused on the necessary pre-processing steps. I understand how to pre-process purely continuous data, what I haven't been able to identify is what pre-processing steps are necessary for mixed data? Do I just scale my continuous variables, impute missing data, and leave the categorical variables alone? Or do I need to perform transformations across all of my variable types?

La solution

This depends on many factors including: the data and data types, the distance metric, the clustering method. You also need bare in mind that different software packages may handle / not handle various steps and transformations differently.

Numerical data:

Normalise or Scale numerical features to ensure that these are on the same scale and or unit variance. For instance min max scale so that all values are in the 0-1 range.

Categorical data:

For nomial data such as gender or country, one can apply Dummy / One Hot Encoding to effectively treat each value as a binary feature. For cases where there is high cardinality (>15), for instance US states, it can be necessary to reduce these by applying feature engineering or other techniques.

Ordinal data is perhaps the hardest to handle. One needs to understand and account for the ordering and relative difference between each value. Take Olympic medals where we can assign Bronze (1), Silver (2), and Gold (3), and then apply MinMax 0-1 scaling to treat these effectively as numerical features. What is key is that this approach implies that silver is double the value of bronze, and gold is three times the value of bronze. This may hold true but can become challenging when there is less clear order in the data. One I frequently have to deal with is company revenue bins of unequal size. Another approach is to use the fraction of each value in respect of a target in the case of classification.

I am writing this python notebook and blog post on clustering mixed datatype data here - it’s a work in progress but the key concepts are there.

Autres conseils

If it's categorical you can use the presence/absence of each category as a feature.

If it's nominal (ordered) you can normalize their values to be between 0 and 1 (or whatever range you normalized the rest of your data to).

Licencié sous: CC-BY-SA avec attribution

Non affilié à datascience.stackexchange