Question

Juste par curiosité, est-il généralement une bonne idée de réduire la dimension du jeu de la formation avant de l'utiliser pour former classificateur SVM?

J'ai une collection de documents, chacun d'eux est représenté par un vecteur avec un poids-idf calculé tf par scikit-learn de tfidf_transformer. Le nombre de termes (fonction?) Est proche de 60k, et avec mon jeu de formation qui se compose d'environ 2.5mil des documents, il rend le processus de formation pour durer éternellement.

En plus de prendre une éternité pour le train, le classement était également inexact très probablement en raison du mauvais modèle. Juste pour avoir une idée de ce que je traite, j'ai essayé de trouver un moyen de visualiser les données en quelque sorte. Et je décomposé la matrice de document dans un (m, 2) en utilisant la matrice avec SVD scikit-learn (voulu essayer d'autres méthodes, mais ils ont tous se sont écrasés à mi-chemin).

Voilà donc ce que les regards de visualisation comme

Ainsi est-il généralement une bonne pratique pour réduire la dimension, et seulement procéder à SVM? Même dans ce cas, que puis-je faire pour améliorer la précision du classificateur? Je suis en train d'utiliser sklearn.svm.SVC et kernel='poly' et degree=3 et il prend un temps très long pour terminer.

Était-ce utile?

La solution

I'd recommend spending more time thinking about feature selection and representation for your SVM than worrying about the number of dimensions in your model. Generally speaking, SVM tends to be very robust to uninformative features (e.g., see Joachims, 1997, or Joachims, 1999 for a nice overview). In my experience, SVM doesn't often benefit as much from spending time on feature selection as do other algorithms, such as Naïve Bayes. The best gains I've seen with SVM tend to come from trying to encode your own expert knowledge about the classification domain in a way that is computationally accessible. Say for example that you're classifying publications on whether they contain information on protein-protein interactions. Something that is lost in the bag of words and tfidf vectorization approaches is the concept of proximity—two protein-related words occurring close to each other in a document are more likely to be found in documents dealing with protein-protein interaction. This can sometimes be achieved using $n$-gram modeling, but there are better alternatives that you'll only be able to use if you think about the characteristics of the types of documents you're trying to identify.

If you still want to try doing feature selection, I'd recommend $\chi^{2}$ (chi-squared) feature selection. To do this, you rank your features with respect to the objective

\begin{equation} \chi^{2}(\textbf{D},t,c) = \sum_{e_{t}\in{0,1}}\sum_{e_{c}\in{0,1}}\frac{(N_{e_{t}e_{c}}-E_{e_{t}e_{c}})^{2}}{E_{e_{t}}e_{c}}, \end{equation} where $N$ is the observed frequency of a term in $\textbf{D}$, $E$ is its expected frequency, and $t$ and $c$ denote term and class, respectively. You can easily compute this in sklearn, unless you want the educational experience of coding it yourself $\ddot\smile$

Autres conseils

While performing PCA on your tfidf vectors or stemming or eliminating infrequent words might reduce the dimensionality significantly, you might want to try topic modeling.

While your initial problem of creating topics and assigning topics to documents will remain in high-dimenional space, the supervised portion will be in topic space.

There are plenty of fast implementations of LDA and similar topic models out there.

SVM on so many training vectors will take a very long time and a lot of memory[O(n^3) time and O(n^2) space, n is the no. of training vector]. You could use an SVM library with GPU speed up, that might help a little. As mentioned in earlier posts, features don't matter much for SVMs.

What is your C value? For some values of C, it takes extra long. You could try tuning it with a few vectors and grid search.

Neural networks learn well from such huge datasets, coupled with RBM it should give good results. Or you could try Random Forests/Decision Trees with Bagging/XGBoost or other ensemble methods. They are surprisingly good and can be super fast (relative to SVM).

Also, this might be unimportant but I don't have enough reputation to comment- what sort of visualisation is this- PCA? If this doesn't give you a good idea, you could try MDS, maybe?

Licencié sous: CC-BY-SA avec attribution
scroll top