Question

I have a dataset which contains few variables whose values do not change. Some of the variables are non-numeric (for example all values for that variable contain the value 5) and few variables are real-valued but all same values. When doing standardization of the variables so that each is a zero mean and variance 1, these variables give NaN values. Therefore, is it ok to exclude such variables (irrespective of being categorical or real-valued) that contain constant values from the normalization/standardization step? These variables are important as features hence I cannot delete them. Is there any other way to handle such variables?

Was it helpful?

Solution

By definition, if these columns or features contain a constant value and yet the output variables change, then they are not influencing the output and likely can be ignored.

A more formal test is to determine how much of the variance between a model that uses that feature is attributable to that feature.

A simple example to illustrate this principle is to look up examples of PCA. In those examples, the technique tries and identifies feature that drive the most variance.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top