Multiclass Classification and log_loss

https://datascience.stackexchange.com/questions/81274

13-12-2020
|

Pergunta

I hope I can make this clear with few lines of code/explanation.

I've a 16K list of texts, labelled over 30 different classes that were ran through different classifiers; my Prediction and the Ground truth match on average at 94%.

I am now after measuring something extra (not sure what should I measure on top of an F1_score minimum as I'm still learning, though) and I came across the log_loss from sklearn, which result I understand it range between 0 and 1. When ran against my prediction, however, the results is 1.48xxx, which is in fact higher.

In trying to understand what was wrong.

I have explored the result of ComplementNB.predict_proba that is required for the log_loss, and the value matches the one of my prediction array.

Below some code:

from sklearn.metrics import log_loss

y = ... # This is my array of value that is my source of truth

labels = numpy.unique(y)
label_ary = [idx for gt in y for idx, lbl in enumerate(labels) if gt == lbl]

print(f'The log loss is {log_loss(label_ary, clf.predict_proba(X.toarray()))}')

Whether I use label_ary or y, in both the circumstance I am obtaining the same value, meaning that some conversion inside the log_loss is already happening.

I'm not sure whether it me misinterpreting the results, or the specific of the function.

What am I doing wrong? Thanks

Solução

Interpretability of log loss

Log loss isn't necessarily between the range [0; 1] - it only expects input to be in this range. Take a look at this example: $$ y_{pred} = 0.1 \\ y_{true} = 1.0 \\ log\_loss = -(log(y_{pred}) * y_{true} + (1 - y_{true}) * log(1 - y_{pred})) = -(log(0.1) * 1.0) = 2.302 $$ In an extreme case log loss can even be equal to infinity. So there is nothing wrong with the code and also there aren't many interesting things you can derive from the fact that log_loss is lower or greater than 1. What you can do with it is the same as with any loss function - compare it to a similar model with different hyperparameters and choose the one with the lowest average loss as your best model (a process called hyperparameter optimization).

When to use loss and when f1 score?

Let's say you have a dataset and a classification problem which you want to solve. You know that you can create a statistical model which returns probabilities of a given class. You also know that there is (hypothetically) an algorithm which classifies based on some heuristics which requires no training. You would like to know which of these is the best for your problem. What you do, if we simplify a little, is:

Split your dataset to train, validation and test sets.
Use your train set to train the model
While training the model calculate loss for train and validation set in each epoch (if you're not using deep neural networks you can and should use cross validation).
Plot loss for train and validation set and see if your model is biased (high train loss and high validation loss) or overfitted (low train loss and high validation loss). The lower loss for validation set the better.
Do 3. and 4. multiple times for different hyperparameters and select one with the lowest validation set loss. You now have a trained statistical model.
Now use f1 score to compare your model to the algorithm you also know about. The higher score the better. Notice that assuming that the algorithm returns classes and not probabilities if it is incorrect for even one example its log loss function will be equal to infinity. This is why we can't use log loss as a metric to compare these two methods.

In short, you should use loss as a metric during training/validation process to optimize parameters and hyperparameters and f1 score (and possibly many more metrics for example Area Under Curve) during test process to select the best method to solve your problem. This way it's possible to compare different methods to solve the problem - even ones which don't use machine learning at all.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange