Question

I have a multi-class classification problem and I am primarily using macro-average F1 measure to evaluate the performance of models.

I want to verify if the results are statistically significant.

I have the results if two classifiers on the same train/test-set.(paired observations).

Some sources suggest to use McNemar’s test for Binary classification task, however, is there any generalization of McNemar’s test for multi-class classification problem. If so, what would be the appropriate procedure to carry out these tests.

Was it helpful?

Solution

Generalisation of Mcnemars is called Cochran–Mantel–Haenszel test

There is an implementation in R, but I suppose porting to python should not be too hard. You can find the r version here

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top