How do I do use non-integer string labels with SVM from scikit-learn? Python

https://stackoverflow.com/questions/12946373

08-07-2021
|

Question

Scikit-learn has fairly user-friendly python modules for machine learning.

I am trying to train an SVM tagger for Natural Language Processing (NLP) where my labels and input data are words and annotation. E.g. Part-Of-Speech tagging, rather than using double/integer data as input tuples [[1,2], [2,0]], my tuples will look like this [['word','NOUN'], ['young', 'adjective']]

Can anyone give an example of how i can use the SVM with string tuples? the tutorial/documentation given here are for integer/double inputs. http://scikit-learn.org/stable/modules/svm.html

Solution

Most machine learning algorithm process input samples that are vector of floats such that a small (often euclidean) distance between a pair of samples means that the 2 samples are similar in a way that is relevant for the problem at hand.

It is the responsibility of the machine learning practitioner to find a good set of float features to encode. This encoding is domain specific hence there is not general way to build that representation out of the raw data that would work across all application domains (various NLP tasks, computer vision, transaction log analysis...). This part of the machine learning modeling work is called feature extraction. When it involves a lot of manual work, this is often referred to as feature engineering.

Now for your specific problem, POS tags of a window of words around a word of interest in a sentence (e.g. for sequence tagging such as named entity detection) can be encoded appropriately by using the DictVectorizer feature extraction helper class of scikit-learn.

OTHER TIPS

This is not so much a scikit or python question, but more of a general issue with SVMs.

Data instances in SVMs must be be represented as vectors of scalars of sorts, typically, real numbers. Categorical Attributes must therefore first be mapped to some numeric values before they can be included in SVMs.

Some categorical attributes lend themselves more naturally/logically to be mapped onto some scale (some loose "metric"). For example a (1, 2, 3, 5) mapping for a Priority field with values of ('no rush', 'standard delivery', 'Urgent' and 'Most Urgent') may make sense. Another example may be with colors which can be mapped to 3 dimensions one each for their Red, Green, Blue components etc.
Other attributes don't have a semantic that allows any even approximate logical mapping onto a scale; the various values for these attributes must then be assigned an arbitrary numeric value on one (or possibly several) dimension(s) of the SVM. Understandingly if an SVM has many of these arbitrary "non metric" dimensions, it can be less efficient at properly classifying items, because the distance computations and clustering logic implicit to the working of the SVMs are less semantically related.

This observation doesn't mean that SVMs cannot be used at all when the items include non numeric or non "metric" dimensions, but it is certainly a reminder that feature selection and feature mapping are very sensitive parameters of classifiers in general and SVM in particular.

In the particular case of POS-tagging... I'm afraid I'm stumped at the moment, on which attributes of the labelled corpus to use and on how to map these to numeric values. I know that SVMTool can produce very efficient POS-taggers, using SVMs, and also several scholarly papers describe taggers also based on SVMs. However I'm more familiar with the other approaches to tagging (e.g. with HMMs or Maximum Entropy.)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow