Question

I want to test the accuracy of a methodology. I ran it ~400 times, and I got a different classification for each run. I also have the ground truth, i.e., the real classification to test against.

For each classification I computed a confusion matrix. Now I want to aggregate these results in order to get the overall confusion matrix. How can I achieve it?

May I sum all confusion matrices in order to obtain the overall one?

Was it helpful?

Solution

I do not know a standard answer to this, but I thought about it some times ago and I have some ideas to share.

When you have one confusion matrix, you have more or less a picture of how you classification model confuse (mis-classify) classes. When you repeat classification tests you will end up having multiple confusion matrices. The question is how to get a meaningful aggregate confusion matrix. The answer depends on what is the meaning of meaningful (pun intended). I think there is not a single version of meaningful.

One way is to follow the rough idea of multiple testing. In general, you test something multiple times in order to get more accurate results. As a general principle one can reason that averaging on the results of the multiple tests reduces the variance of the estimates, so as a consequence, it increases the precision of the estimates. You can proceed in this way, of course, by summing position by position and then dividing by the number of tests. You can go further and instead of estimating only a value for each cell of the confusion matrix, you can also compute some confidence intervals, t-values and so on. This is OK from my point of view. But it tell only one side of the story.

The other side of the story which might be investigated is how stable are the results for the same instances. To exemplify that I will take an extreme example. Suppose you have a classification model for 3 classes. Suppose that these classes are in the same proportion. If your model is able to predict one class perfectly and the other 2 classes with random like performance, you will end up having 0.33 + 0.166 + 0.166 = 0.66 misclassification ratio. This might seem good, but even if you take a look on a single confusion matrix you will not know that your performance on the last 2 classes varies wildly. Multiple tests can help. But averaging the confusion matrices would reveal this? My belief is not. The averaging will give the same result more or less, and doing multiple tests will only decrease the variance of the estimation. However it says nothing about the wild instability of prediction.

So another way to do compose the confusion matrices would better involve a prediction density for each instance. One can build this density by counting for each instance, the number of times it was predicted a given class. After normalization, you will have for each instance a prediction density rather a single prediction label. You can see that a single prediction label is similar with a degenerated density where you have probability of 1 for the predicted class and 0 for the other classes for each separate instance. Now having this densities one can build a confusion matrix by adding the probabilities from each instance and predicted class to the corresponding cell of the aggregated confusion matrix.

One can argue that this would give similar results like the previous method. However I think that this might be the case sometimes, often when the model has low variance, the second method is less affected by how the samples from the tests are drawn, and thus more stable and closer to the reality.

Also the second method might be altered in order to obtain a third method, where one can assign as prediction the label with highest density from the prediction of a given instance.

I do not implemented those things but I plan to study further because I believe might worth spending some time.

OTHER TIPS

There are a few ways to achieve your "master confusion matrix".

  1. Sum all the confusion matrices together: Like you suggested, summing this results in a confusion matrix. The problem with this is you can not interpret totals.

  2. Average the entries. This method is the same as number one, but you divide each entry by the number of trials (~400 in your case). This would be my preferred method because then you can actually translate each category to a (mean) +- (an error measurement) and actually see which categories are the most volatile or stable. Careful with interpreting this 'error measurement' though.

  3. Report a problem specific measurement of the confusion numbers. For example, if your numbers have outliers, medians would preferred over means.

There are other statistics that are possible to report as well. You can redo the method to keep track of individual classifications. Then we can say other important stats like '% of classifications that stay the same and are accurate', etc...

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top