Word classification algorithms

https://cs.stackexchange.com/questions/14037

16-10-2019
|

Question

I am looking for algorithms to classify words in a paragraph of text. I am particularly interested in a classification to determine if a certain word is noun, verb, etc., but also looking for any kind of word-classification algorithms. I am given a smaller input (e.g. 200 words) so there is no many options for machine learning but I would really appreciate algorithms involving machine learning.

Solution

As D.W. said, the task of classifying words as noun, verb, etc. is called part-of-speech tagging:

enter image description here

The machine learning algorithm typically used for this purpose is the Log-Linear Model (aka. Maximum-Entropy, in short MaxEnt). It is actually pretty straightforward:

Assume you read the sentence sequentially (left to right, or right to left)
Let $h$ be the history of the last 2 tags you have assigned to the 2 previous words in the sentence as well as the list of all the words in your sentence.
Let $t$ be the tag you assign to the current word.
You defined many features $f_i(h, t)$, typically some indicator functions like $f_i(h, t)=1$ if current word ends with 'ing`, $f_i(h, t)=0$ otherwise.
$v$ is a parameter vector with the same number of elements of the number of features $f_i(h, t)$. $v$ is used to weight the features.
You define $p(t_i|t_{i-1},t_{i-2},w_1,...,w_n) = \frac {\exp \left( \sum_{i} v_i f_i(h, t) \right)\ }{\sum_{t'} \exp \left(\sum_{i} v_i f_i(h, t') \right)\ }$ (this is named multinomial logistic regression, a.k.a. maximum entropy (MaxEnt) classifier, multiclass logistic regression, polytomous logistic regression, softmax regression.)
Use Markov assumption: $p(t_1...t_n|w_1...w_n) = \prod_{i=1}^{n} p(t_i|t_{i-1},t_{i-2},w_1,...,w_n)$ (that's the probability that your sentence $w_1,...,w_n$ has tags $t_1,...,t_n$)
You find the $t_1,...,t_n$ that maximizes $\prod_{i=1}^{n} p(t_i|t_{i-1},t_{i-2},w_1,...,w_n)$ using the Viterbi algorithm: that's the final tagging of your sentence.
You train the model ($v$) so that it maximizes the likelihood of your training set (and add some regularization to improve generalization)

Most of the work to improve this main idea is about:

Finding better features
Do some reranking (e.g. Global Linear Models, which allows to define global features that take as inputs the entire tagging history, and not just the last 2 tags as before, whose feature weights are trained with some gradient descent like algorithm)

As usual, the more data, the better.

For more details see Coursera's Natural Language Processing by Michael Collins (Coursera's Natural Language Processing by Dan Jurafsky and Christopher Manning is great too).

Note that the task to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. is called Named-entity recognition.

OTHER TIPS

This is a standard problem in Natural Language Processing (NLP).

Classifying whether each word is a noun, verb, etc., is known as "part-of-speech tagging". Searching on that term should turn up a lot of information about how to do that.

If you want to tag the words in some other way, you might be able to use other tagging algorithms. They generally use some sort of machine learning approach and will probably require a large training set (200 words won't be enough).

Sadly, the English language has denied even the most brilliant Scientists a way to answer this question. I have been working in the area of geometric representations of the English Language for Memory Modelling and I've (maybe somewhat prematurely) come to the conclusion that the unbalanced laws of the English language allow it to mutate continuously with often contradictory and logically infeasible states.

Still the game is great to play and I can see no reason that with the right direction a computer of an imaginable amount of muscle could possibly implement learning algorithms to properly understand the nature of the English Language.

....Quantum Mechanics does provide some interesting leads though :-)

Licensed under: CC-BY-SA with attribution

Not affiliated with cs.stackexchange