Weka normalizing columns

https://stackoverflow.com/questions/2271203

20-09-2019
|

Question

I have an ARFF file containing 14 numerical columns. I want to perform a normalization on each column separately, that is modifying the values from each colum to (actual_value - min(this_column)) / (max(this_column) - min(this_column)). Hence, all values from a column will be in the range [0, 1]. The min and max values from a column might differ from those of another column.

How can I do this with Weka filters?

Thanks

Solution

This can be done using

weka.filters.unsupervised.attribute.Normalize

After applying this filter all values in each column will be in the range [0, 1]

OTHER TIPS

That's right. Just wanted to remind about the difference of "normalization" and "standardization". What mentioned in the question is "standardization", while "normalization" assumes Gaussian distribution and normalizes by mean, and standard variation of each attribute. If you have an outlier in your data, the standardize filter might hurt your data distribution as the min, or max might be much farther than the other instances.

Here is the working normalization example with K-Means in JAVA.

final SimpleKMeans kmeans = new SimpleKMeans();

final String[] options = weka.core.Utils
        .splitOptions("-init 0 -max-candidates 100 -periodic-pruning 10000 -min-density 2.0 -t1 -1.25 -t2 -1.0 -N 10 -A \"weka.core.EuclideanDistance -R first-last\" -I 500 -num-slots 1 -S 50");
kmeans.setOptions(options);

kmeans.setSeed(10);
kmeans.setPreserveInstancesOrder(true);
kmeans.setNumClusters(25);
kmeans.setMaxIterations(1000);

final BufferedReader datafile = new BufferedReader(new FileReader("/Users/data.arff");
Instances data = new Instances(datafile);

//normalize
final Normalize normalizeFilter = new Normalize();
normalizeFilter.setInputFormat(data);
data = Filter.useFilter(data, normalizeFilter);

//remove class column[0] from cluster
data.setClassIndex(0);
final Remove removeFilter = new Remove();
removeFilter.setAttributeIndices("" + (data.classIndex() + 1));
removeFilter.setInputFormat(data);
data = Filter.useFilter(data, removeFilter);

kmeans.buildClusterer(data);

System.out.println(kmeans.toString());

// evaluate clusterer
final ClusterEvaluation eval = new ClusterEvaluation();
eval.setClusterer(kmeans);
eval.evaluateClusterer(data);
System.out.println(eval.clusterResultsToString());

If you have CSV file then replace BufferedReader line above with below mentioned Datasource:

final DataSource source = new DataSource("/Users/data.csv");
final Instances data = source.getDataSet();

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow