Question

I have over 15000 text docs of a specific topic. I would like to build a language model based on the former so that I can present to this model new random text documents of various topics and the algorithms tells if the new doc is of the same topic.

I tried out sklearn.naive_bayes.MultinomialNB, sklearn.svm.classes.LinearSVC and others, however I have the following problem:

These algorithms require training data with more than one label or category and I only have web pages of covering a specific topic. The other docs are not labeled and of many different topics.

I would appreciate any guidance on how to train a model with only one label or how to proceed in general. What I have so far is:

c = MultinomialNB()
c.fit(X_train, y_train)
c.predict(X_test)

Thank you very much.

Was it helpful?

Solution

What you're looking for is the OneClassSvm. For more information you might want to check out the corresponding documentation at this link.

OTHER TIPS

There is another classifier available in the TextBlob module called PositiveNaiveBayesClassifier. To quote from their documentation:

A variant of the Naive Bayes Classifier that performs binary classification with partially-labeled training sets, i.e. when only one class is labeled and the other is not. Assuming a prior distribution on the two labels, uses the unlabeled set to estimate the frequencies of the features.

Code Usage:

>>> from text.classifiers import PositiveNaiveBayesClassifier
>>> sports_sentences = ['The team dominated the game',
                        'They lost the ball',
                        'The game was intense',
                        'The goalkeeper catched the ball',
                        'The other team controlled the ball']
>>> various_sentences = ['The President did not comment',
                         'I lost the keys',
                         'The team won the game',
                         'Sara has two kids',
                         'The ball went off the court',
                         'They had the ball for the whole game',
                         'The show is over']
>>> classifier = PositiveNaiveBayesClassifier(positive_set=sports_sentences,
                                unlabeled_set=various_sentences)
>>> classifier.classify("My team lost the game")
True
>>> classifier.classify("And now for something completely different.")
False

OCC problems are closely related to anomaly detection/Novel detection. In these problems, we have only positive classes and they are generally non-Gaussian.

The main motivation for OCC is the lack of dataset available to define it as another class. Generally, one-vs-other metrics are improved for these tasks with any discriminative model.

Popular approaches are based on SVM such as one-class SVM which generally have non-flexible geometry boundary(subscribing hyper-ball) and for flexible one (without translation invariant kernel) is support vector data description (SVDD) [WIP].

So one-class SVM is a specific case of SVDD with K(x,x)=const.

For more details check here.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top