Question

I'm coming across a metrics for model evaluation which I had never seen before and I don't know how to further research (since I don't know its proper name).

I'm using someone else's code, whose goal is to perform cross-validation to choose the best tree-based algorithm for a binary classification. It is probably worth saying that classes are highly skewed (93% / 7%). The metric which is used is the following: the classifier is trained and then the probability associated with each test element is computed.

probas = probas[:,list(clf.classes_).index(1)]

Then, these probabilities are ordered from the highest to the lowest and put in the x-axis. On the y-axis, being y the cumulative sum of entries associated with each probability. Then, they compute the area under the obtained curve, as in:

joint = zip(probas, truth)
joint = sorted(list(joint), key=lambda x:x[0], reverse=True)
probas = [x[0] for x in joint]
truth = [x[1] for x in joint]

# Calculate accumulated number of true labels at each probability point.
# Also calculate Area Under Curve (AUC) (higher is better model)
truth_cumulative = np.cumsum(truth) / np.sum(truth)
area = np.trapz(truth_cumulative, dx=1) / len(truth)

Can anyone give me an intuition of what this metric is about and a pointer to some resources to better understand it?

Thanks.

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top