Question

I'm using sklearn Random Forrest to train my model. With the same input features for the model I tried passing the target labels first with label_binarize to create one hot encodings of my target labels and second I tried using label_encoder to encode my target labels. In both cases I'm getting different accuracy score. Is there a specific reason why this is happening, as I'm just using a different method to encode the labels without changing any input features.

Was it helpful?

Solution

Yes. With y being a 1d array of integers (as after LabelEncoder), sklearn treats it as a multiclass classification problem. With y being a 2d binary array (as after LabelBinarizer), sklearn treats it as a multilabel problem.

Presumably, the multilabel model is predicting no labels for some of the rows. (With your actual data not being multilabel, the sum of probabilities across all classes from the model will probably still be 1, so the model will never predict more than one class. And if always exactly one class gets predicted, the accuracy score for the multiclass and multilabel models should be the same.)

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top