Out of Memory Error in Scikit-learn MultinomialNB

Question

Let me summarize the outcome of the discussion in the comments:

The label preprocessing machinery used internally in many scikit-learn classifiers does not scale well memory wise w.r.t. the number of classes. This is a known issue and there is ongoing work to tackle it.
The MultinomialNB class it-self will probably not be suitable to classify in a label space with cardinality 43K even if the label preprocessing limitation is fixed.

To address the large cardinality classification problem you could try:

fit binary SGDClassifier(loss='log', penalty='elasticnet') instances on columns of y_train converted as numpy arrays independently, then call clf.sparsify() and finally wrap those sparse models as a final one-vs-rest classifier (or rank predictions of the binary classifier by proba). Dependending on the value of the regularizer parameter alpha you might get sparse models that are small enough to fit in memory. You can also try to do the same with LogisticRegression, that is something like:

clf_label_i = LogisticRegression(penalty='l1').fit(X_train, y_train[:, label_i].toarray()).sparsify()
alternatively try to do a PCA of the target labels y_train, then cast your classification problem as a multi-output regression problem in the reduced label PCA space, and then decode the regressor's output by looking for the nearest class encoding in the label PCA space.

You can also have a look at Block Coordinate Descent Algorithms for Large-scale Sparse Multiclass Classification implemented in lightning but I am not sure it suitable for label cardinality 43K either.