Best library for automatic document classification [closed]

https://stackoverflow.com/questions/16605931

29-05-2022
|

Domanda

The problem: we have a bunch of documents (magazine articles) that need to be put into "categories". Some categories reflect the subject of the article (what the article is about) and some other categories reflect the "nature" of the article (where it would be likely to appear if the magazine were printed on paper).

We're currently addressing the problem manually by sending the articles offshore and have people look at them and tag them.

We'd like to automate the process more. I've looked at various libraries but they don't seem designed to solve this problem.

Carrot² does clustering of search results but it's not clear, without diving in further, if it can work with existing (fixed) categories or if it infers categories directly from each input.

NLTK is a generalist solution that does many things, but doesn't have a reputation for speed or accuracy. May be my best bet though?

Ideally I would like to find a solution that given a list of categories and a training set of categorized documents, is able to suggest a category for new documents, and its confidence in the accuracy of its suggestion.

If this doesn't exist ready made, I can try and write something based on NLTK's NaiveBayesClassifier, but what are the other options?

Soluzione

For this supervised classification task I would use the Stanford Classifier. It embeds everything from features extraction (much, much more sophisticated than bag of words) to top-notch machine learning (max entropy model). It works pretty well if you have enough training data (i.e. articles labelled manually).

The only thing is, it will just assign one class per article. But since your two "dimensions" (the topic of the article, and the kind of the article) seem to be reasonably orthogonal, nothing prevents you from treating the two dimensions as two separate classification problems.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow