Question

What is the exact accuracy of a random classifier which has n labels (say 1000) where k labels (say 50) are true? Can I say the accuracy of a random classifier has an upper bound of k/n?

-Edit-

I am interested in a numerical figure or an upper bound for a random classifier (for example the accuracy of a classifier randomly picking 50 classes out of 1000 where some 50 classes are correct). Sorry if my previous question is not fully understood.

Was it helpful?

Solution

I'll assume you want accuracy as @AlexanderRoiatchki defined it (see also CV.SE), and want exactly $k$ distinct labels, both true and predicted.

Letting $X$ denote the random variable of the number of correct labels for an instance, we have that $$ \mathbb{P}(X=x) = \binom{k}{x} \binom{n-k}{k-x} / \binom{n}{k} $$ (choose the correct labels and the incorrect ones separately). The accuracy score from a given instance is $$ \frac{|T_i \cap P_i| }{ |T_i \cup P_i| } = \frac{ X}{2k-X}. $$

So the expected value of the accuracy is $$ \mathbb{E}\left[\frac{X}{2k-X}\right] = \frac{1}{\binom{n}{k}} \sum_{x=0}^k \frac {x \binom{k}{x} \binom{n-k}{k-x}} {2k-x}.$$ That doesn't seem to have a nice closed form. When $k\ll \sqrt{n}$, this seems to be roughly $k/2n$, but I've made a big assumption about the non-dominating terms of the summation being negligible.

In the specific case you mentioned, $k=50, n=1000$, I get 0.0259.


Edit, in response to comment: if you want to define accuracy as just $X/k$, then the expected accuracy of the random classifier simplifies nicely to $k/n$:

$$ \begin{align*} \frac{1}{\binom{n}{k}} \sum_{x=0}^k \frac{x}{k} \binom{k}{x} \binom{n-k}{k-x} &= \frac{1}{k\binom{n}{k}} \left( k \binom{n-1}{k-1} \right) & \text{combinatorial identity*} \\ &= \frac{ (n-1)! / [(k-1)! (n-k)!] }{ n! / [k! (n-k)!] } & \text{just expand coefficients}\\ &= \frac{k}{n} & \text{simplify factorials} \end{align*} $$

(* I can't help but elaborate. In the current context, this is the number of choices with $X=x$ with the additional decoration of a special correct label (one of the $x$). To get the RHS then, pick the decorated label first ($k$), then choose the remaining labels without regard to how many are correct ($\binom{n-1}{k-1}$). )

OTHER TIPS

In the context of a multi-label classifier (that is, having potentially more than one label per data point) we can have a prediction which is either fully correct, partially correct or fully incorrect. To account for this, the following accuracy metric has been proposed: $$\text{Accuracy} = \frac{1}{n}\sum_{i=1}^n\frac{|T_i \cap P_i|}{|T_i\cup P_i|},$$ where $n$ is the sample size, $|\cdot|$ denotes the number of elements of a set, $T_i$ is the set of true labels and $P_i$ is the set of predicted labels for an item of the sample.

For instance, let's say the $i$-th item has actual labels $A, B, C$ and $D$ (that is $T_i = \{A, B, C, D\}$, but your classifier predicts labels $B, D$ and $E$ (that is $P_i =\{B, D, E\}$), then you'd get $$\frac{|T_i\cap P_i|}{|T_i\cup P_i|}=\frac{|\{B,D\}|}{|\{A,B,C,D,E\}|}=\frac{2}{5}.$$ To get the accuracy for the entire training set, you calculate this for each instance $i=1,\dots, n$ and you average it out.

You can also have a look at the evaluation metrics section of the article in Wikipedia about multi-label classifiers: https://en.wikipedia.org/wiki/Multi-label_classification#Statistics_and_evaluation_metrics

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top