Question

I have recently used xgboost to conduct binary classification in an nlp problem. The idea was to identify if a particular article belonged to an author or not, pretty standard exercise.

The results are outputted as a probability between 0 and 1, and there is the ocasional article that is completely misclassified.

I would like to know if there is a statistical approach that gives me a confidence interval for the probability outputs (for example if I consider all articles with prediction of 0.4 I will get 95% of the articles that belong to the author), or something that helps me make decisions regarding the cut-offs.

Was it helpful?

Solution

What you're looking for is something along the line of an ROC curve:

Using the threshold as a decision parameter, you can observe the trade-off between FPR (False Positive Rate: how many of the articles not belonging to the author will be correctly classified) and TPR (True Positive Rate, aka recall: how many of the articles which are really by the author will be classified as such).

When the parameter is at one end, you'll classify all documents as belonging to the author (100% recall, but pretty bad precision), and at the other hand, you'll have 100% precision but pretty bad recall.

The plot will allow you to decide on a value that satisfies your requirements (i.e. how much will your precision suffer when you want 95% recall). You can select it based on your desired value in one metric (e.g. 95% recall), but really I'd just plot it and have a look. You can do it in SKLearn with plot_roc_curve.

OTHER TIPS

Agree with Itama answer. Just want to edit that ROC curve shows trade off between TPR (recall) and FPR (not precision). Using ROC curve, you can find an optimal threshold via finding the point on the curve closest to point (0, 1), meaning we try to make FPR close to 0 while TPR close to 1.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top