Decision tree and SVM for text classification - theory

https://datascience.stackexchange.com/questions/76795

12-12-2020
|

Question

I used 4 classifiers for my text data: NB, kNN, DT and SVM. As for NB and kNN I fully understand how they work with text - how we can count probabilities for all words in NB and how to use similarity metrics with TF-IDF vectors in kNN I don't understand at all how decision tree and support vector machine work with text data. I implemented all algorithms in Python so all I need is some resource or explanation how the other two classifiers work with text...

I understand DT with non-text data - it seams logical for example nodes with checking if some data is more/less than some number. But with text I get confused. Does it operate on text or with numerical vectors? The same applies to SVM...

Solution

Similarly to NB or kNN, the DT and SVM algorithms work with the features which are provided as input. So whenever ML is applied to text it's important to understand how the unstructured text is transformed into structured data, i.e. how text instances are represented with features.

There are many options, but traditionally a document is represented as as a vector over the full vocabulary. A very simple version of this is a boolean vector: a cell $v_i$ contains 1 if the word $w_i$ occurs in the document and 0 otherwise. The DT training will generate the tree the usual way, so in this case the conditions at the nodes will be v_i == 1, representing whether the word $w_i$ is present or not. If the values in the vector are say TFIDF weights, the conditions might look like v_i > 3.5 for instance. Similarly for SVM: the algorithm will find the optimal way to separate the instances in a multi-dimensional space: each dimension actually represents a single word, but the algorithm itself doesn't know (and doesn't care) about that.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange