Classification accuracy of a Random Multi-label Classifier

https://datascience.stackexchange.com//questions/63692

06-12-2019
|

Question

What is the exact accuracy of a random classifier which has n labels (say 1000) where k labels (say 50) are true? Can I say the accuracy of a random classifier has an upper bound of k/n?

-Edit-

I am interested in a numerical figure or an upper bound for a random classifier (for example the accuracy of a classifier randomly picking 50 classes out of 1000 where some 50 classes are correct). Sorry if my previous question is not fully understood.

Solution

I'll assume you want accuracy as @AlexanderRoiatchki defined it (see also CV.SE), and want exactly $k$ distinct labels, both true and predicted.

Letting $X$ denote the random variable of the number of correct labels for an instance, we have that $$ \mathbb{P}(X=x) = \binom{k}{x} \binom{n-k}{k-x} / \binom{n}{k} $$ (choose the correct labels and the incorrect ones separately). The accuracy score from a given instance is $$ \frac{|T_i \cap P_i| }{ |T_i \cup P_i| } = \frac{ X}{2k-X}. $$

So the expected value of the accuracy is $$ \mathbb{E}\left[\frac{X}{2k-X}\right] = \frac{1}{\binom{n}{k}} \sum_{x=0}^k \frac {x \binom{k}{x} \binom{n-k}{k-x}} {2k-x}.$$ That doesn't seem to have a nice closed form. When $k\ll \sqrt{n}$, this seems to be roughly $k/2n$, but I've made a big assumption about the non-dominating terms of the summation being negligible.

In the specific case you mentioned, $k=50, n=1000$, I get 0.0259.

Edit, in response to comment: if you want to define accuracy as just $X/k$, then the expected accuracy of the random classifier simplifies nicely to $k/n$:

$$ \begin{align*} \frac{1}{\binom{n}{k}} \sum_{x=0}^k \frac{x}{k} \binom{k}{x} \binom{n-k}{k-x} &= \frac{1}{k\binom{n}{k}} \left( k \binom{n-1}{k-1} \right) & \text{combinatorial identity*} \\ &= \frac{ (n-1)! / [(k-1)! (n-k)!] }{ n! / [k! (n-k)!] } & \text{just expand coefficients}\\ &= \frac{k}{n} & \text{simplify factorials} \end{align*} $$

(* I can't help but elaborate. In the current context, this is the number of choices with $X=x$ with the additional decoration of a special correct label (one of the $x$). To get the RHS then, pick the decorated label first ($k$), then choose the remaining labels without regard to how many are correct ($\binom{n-1}{k-1}$). )

OTHER TIPS

In the context of a multi-label classifier (that is, having potentially more than one label per data point) we can have a prediction which is either fully correct, partially correct or fully incorrect. To account for this, the following accuracy metric has been proposed: $$\text{Accuracy} = \frac{1}{n}\sum_{i=1}^n\frac{|T_i \cap P_i|}{|T_i\cup P_i|},$$ where $n$ is the sample size, $|\cdot|$ denotes the number of elements of a set, $T_i$ is the set of true labels and $P_i$ is the set of predicted labels for an item of the sample.

For instance, let's say the $i$-th item has actual labels $A, B, C$ and $D$ (that is $T_i = \{A, B, C, D\}$, but your classifier predicts labels $B, D$ and $E$ (that is $P_i =\{B, D, E\}$), then you'd get $$\frac{|T_i\cap P_i|}{|T_i\cup P_i|}=\frac{|\{B,D\}|}{|\{A,B,C,D,E\}|}=\frac{2}{5}.$$ To get the accuracy for the entire training set, you calculate this for each instance $i=1,\dots, n$ and you average it out.

You can also have a look at the evaluation metrics section of the article in Wikipedia about multi-label classifiers: https://en.wikipedia.org/wiki/Multi-label_classification#Statistics_and_evaluation_metrics

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange