Question

I'm using SMOTE filter in WEKA to balance data.
I have doubts about the two parameters nearestNeighbors and percentage.

nearestNeighbors -- The number of nearest neighbors to use.
percentage -- The percentage of SMOTE instances to create.

How should I set them?

I thought the number of neighbors is the amount of syntetic samples it is going to create.
So what's the meaning of percentage? It should be less than or equal to the number of neighbors, right? Is the percentage of syntetic samples considered?

For example:
If I put 10 neighbors and 200% what will happen?
Can anyone give me some examples of correct use?

Was it helpful?

Solution

The nearestNeighbors parameter says how many nearest neighbor instances (surrounding the currently considered instance) are used to build an inbetween synthetic instance. The default value is 5. Thus the attributes of 5 nearest neighbors of a real existing instance are used to compute a new synthetic one.

The percentage parameter says how many synthetic instances are created based on the number of the class with less instances (by default - you can also use the majority class by setting the -C option). The default value is 100. This means if you have 25 instances in your minority class, again 25 instances are created synthetically from these (using their nearest neighbours' values). With 200% 50 synthetic instances are created and so on.

For further information also refer to the weka doc of SMOTE and the original paper of Chawla et al. 2002, where the whole method is explained in depth.

For me it appeared that the Weka SMOTE alone only oversamples the instances. So additionally you can use the supervised SpreadSubsample filter to undersample the minority class instances afterwards.

OTHER TIPS

If you have two classes and want to end up with equal number in each class you need to divide the number of samples in the big class by the number of samples in the smaller class. Take the fractional part of that and multiply it by 100. That's your P parameter.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top