Question

In order to run NB classifier in about 400 MB of text data i need to use vectorizer.

vectorizer = TfidfVectorizer(min_df=2)
X_train = vectorizer.fit_transform(X_data)

But it is giving out of memory error. I am using Linux64 an python 64 bit version. How does people work through Vectorization process in Scikit for large data set (text)

Traceback (most recent call last):
  File "ParseData.py", line 234, in <module>
    main()
  File "ParseData.py", line 211, in main
    classifier = MultinomialNB().fit(X_train, y_train)
  File "/home/pratibha/anaconda/lib/python2.7/site-packages/sklearn/naive_bayes.py", line 313, in fit
    Y = labelbin.fit_transform(y)
  File "/home/pratibha/anaconda/lib/python2.7/site-packages/sklearn/base.py", line 408, in fit_transform
    return self.fit(X, **fit_params).transform(X)
  File "/home/pratibha/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 272, in transform
    neg_label=self.neg_label)
  File "/home/pratibha/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 394, in label_binarize
    Y = np.zeros((len(y), len(classes)), dtype=np.int)

Edited (ogrisel): I changed the title from "Out of Memory Error in Scikit Vectorizer" to "Out of Memory Error in Scikit-learn MultinomialNB" to make it more descriptive of the actual problem.

Was it helpful?

Solution

Let me summarize the outcome of the discussion in the comments:

  • The label preprocessing machinery used internally in many scikit-learn classifiers does not scale well memory wise w.r.t. the number of classes. This is a known issue and there is ongoing work to tackle it.

  • The MultinomialNB class it-self will probably not be suitable to classify in a label space with cardinality 43K even if the label preprocessing limitation is fixed.

To address the large cardinality classification problem you could try:

  • fit binary SGDClassifier(loss='log', penalty='elasticnet') instances on columns of y_train converted as numpy arrays independently, then call clf.sparsify() and finally wrap those sparse models as a final one-vs-rest classifier (or rank predictions of the binary classifier by proba). Dependending on the value of the regularizer parameter alpha you might get sparse models that are small enough to fit in memory. You can also try to do the same with LogisticRegression, that is something like:

    clf_label_i = LogisticRegression(penalty='l1').fit(X_train, y_train[:, label_i].toarray()).sparsify()

  • alternatively try to do a PCA of the target labels y_train, then cast your classification problem as a multi-output regression problem in the reduced label PCA space, and then decode the regressor's output by looking for the nearest class encoding in the label PCA space.

You can also have a look at Block Coordinate Descent Algorithms for Large-scale Sparse Multiclass Classification implemented in lightning but I am not sure it suitable for label cardinality 43K either.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top