How to set parameters in WEKA to balance data with SMOTE filter?

Question 1

The nearestNeighbors parameter says how many nearest neighbor instances (surrounding the currently considered instance) are used to build an inbetween synthetic instance. The default value is 5. Thus the attributes of 5 nearest neighbors of a real existing instance are used to compute a new synthetic one.

The percentage parameter says how many synthetic instances are created based on the number of the class with less instances (by default - you can also use the majority class by setting the -C option). The default value is 100. This means if you have 25 instances in your minority class, again 25 instances are created synthetically from these (using their nearest neighbours' values). With 200% 50 synthetic instances are created and so on.

For further information also refer to the weka doc of SMOTE and the original paper of Chawla et al. 2002, where the whole method is explained in depth.

For me it appeared that the Weka SMOTE alone only oversamples the instances. So additionally you can use the supervised SpreadSubsample filter to undersample the minority class instances afterwards.

Question 2

If you have two classes and want to end up with equal number in each class you need to divide the number of samples in the big class by the number of samples in the smaller class. Take the fractional part of that and multiply it by 100. That's your P parameter.