문제

Just out of curiousity, is it generally a good idea to reduce the dimension of training set before using it to train SVM classifier?

I have a collection of documents, each of them is represented by a vector with tf-idf weight calculated by scikit-learn's tfidf_transformer. The number of terms (feature?) is close to 60k, and with my training set that consist of about 2.5mil of documents, it makes the training process to go on forever.

Besides taking forever to train, the classification also wasn't accurate most probably due to the wrong model. Just to get an idea of what I am dealing with, I tried finding a way to visualize the data somehow. And I decomposed the document matrix into a (m, 2) matrix using SVD with scikit-learn (wanted to try other methods, but they all crashed halfway).

So this is what the visualization looks like

Result of SVD

So is it generally a good practice to reduce the dimension, and then only proceed with SVM? Also in this case what can I do to improve the accuracy of the classifier? I am trying to use sklearn.svm.SVC and kernel='poly', and degree=3 and it is taking a very long time to complete.

도움이 되었습니까?

해결책

I'd recommend spending more time thinking about feature selection and representation for your SVM than worrying about the number of dimensions in your model. Generally speaking, SVM tends to be very robust to uninformative features (e.g., see Joachims, 1997, or Joachims, 1999 for a nice overview). In my experience, SVM doesn't often benefit as much from spending time on feature selection as do other algorithms, such as Naïve Bayes. The best gains I've seen with SVM tend to come from trying to encode your own expert knowledge about the classification domain in a way that is computationally accessible. Say for example that you're classifying publications on whether they contain information on protein-protein interactions. Something that is lost in the bag of words and tfidf vectorization approaches is the concept of proximity—two protein-related words occurring close to each other in a document are more likely to be found in documents dealing with protein-protein interaction. This can sometimes be achieved using $n$-gram modeling, but there are better alternatives that you'll only be able to use if you think about the characteristics of the types of documents you're trying to identify.

If you still want to try doing feature selection, I'd recommend $\chi^{2}$ (chi-squared) feature selection. To do this, you rank your features with respect to the objective

\begin{equation} \chi^{2}(\textbf{D},t,c) = \sum_{e_{t}\in{0,1}}\sum_{e_{c}\in{0,1}}\frac{(N_{e_{t}e_{c}}-E_{e_{t}e_{c}})^{2}}{E_{e_{t}}e_{c}}, \end{equation} where $N$ is the observed frequency of a term in $\textbf{D}$, $E$ is its expected frequency, and $t$ and $c$ denote term and class, respectively. You can easily compute this in sklearn, unless you want the educational experience of coding it yourself $\ddot\smile$

다른 팁

While performing PCA on your tfidf vectors or stemming or eliminating infrequent words might reduce the dimensionality significantly, you might want to try topic modeling.

While your initial problem of creating topics and assigning topics to documents will remain in high-dimenional space, the supervised portion will be in topic space.

There are plenty of fast implementations of LDA and similar topic models out there.

SVM on so many training vectors will take a very long time and a lot of memory[O(n^3) time and O(n^2) space, n is the no. of training vector]. You could use an SVM library with GPU speed up, that might help a little. As mentioned in earlier posts, features don't matter much for SVMs.

What is your C value? For some values of C, it takes extra long. You could try tuning it with a few vectors and grid search.

Neural networks learn well from such huge datasets, coupled with RBM it should give good results. Or you could try Random Forests/Decision Trees with Bagging/XGBoost or other ensemble methods. They are surprisingly good and can be super fast (relative to SVM).

Also, this might be unimportant but I don't have enough reputation to comment- what sort of visualisation is this- PCA? If this doesn't give you a good idea, you could try MDS, maybe?

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 datascience.stackexchange
scroll top