Question

Problem Statement

I am using a document of 1600000 lines and ~66k features. I am using the bag of words approach to build a decision tree. Following code is working fine for 1000 line document. But throws memory error for the actual 1600000 line document. My Server has a 64GB of RAM.

Instead of using .todense() or .toarray(), is there any way to use the sparse matrix directly ? OR Is there any options to reduce the default type float64? Kindly help me on this.

Code:

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,stop_words='english')
X_train = vectorizer.fit_transform(corpus)

clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train.todense(),corpus2)

Error:

Traceback (most recent call last):
  File "test123.py", line 103, in <module>
    clf = clf.fit(X_train.todense(),corpus2)
  File "/usr/lib/python2.7/dist-packages/scipy/sparse/base.py", line 458, in todense
    return np.asmatrix(self.toarray())
  File "/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py", line 550, in toarray
    return self.tocoo(copy=False).toarray()
  File "/usr/lib/python2.7/dist-packages/scipy/sparse/coo.py", line 219, in toarray
    B = np.zeros(self.shape, dtype=self.dtype)
MemoryError

In short, is there any methods to use classification tree for large data set with 66k features.?

Was it helpful?

Solution

Add dtype=np.float32 eg: vec = TfidfVectorizer(..., dtype=np.float32)

As for sparse/dense I have similar problem. GradientBoostingClassifier, RandomForestClassifier or DecisionTreeClassifier need dense data, for that reason I use SVC.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top