Document Categorization Problem

https://datascience.stackexchange.com/questions/10880

16-10-2019
|

Pergunta

I'm very new to data science in general, and have been tasked with a big challenge.

My organization has a lot of documents that are all sorted on document type (not binary format, but a subjectively assigned type based on content, e.g. "Contract", "Receipt", "Statement", etc...).

Generally speaking assignment of these types is done upon receipt of the documents, and is not a challenge, though we would like to remove the human element of this categorization. Similarly, there are times when there are special attributes that we would like to identify, like "Statement showing use." Thus far, this is entirely done by human intervention.

I am a python programmer, and have been looking at tools to extract the text from these docs (all PDFs, all OCR'ed and searchable) and run analysis. Research has led me to look at standard libraries like NLTK, scikit-learn and gensim. But I'm struggling to identify what would be the best methodology for categorizing newly received documents.

My research is leading me down a few paths...one is creating a Tf-iDf vector model based on a sampling of current corpa and then creating a model for an incoming document's corpus and doing a naive bayes analysis against existing models to discern which category the incoming document belongs to based on highest probability. Question 1: is this right? If so question 2 becomes what is the right programmatic methodology for accomplishing this?

The reason I bring this up at all is because most tutorials I find seem to lean toward a binary discernment of text corpa (positive vs negative, spam vs ham). I did see scikit-learn has information on multi-label classification, but I'm not sure I'm going down the right road with it. The word "classification" seems to have different meaning in document analysis than what I would want it to mean.

If this question is too vague let me know and I can edit it to be more specific.

Solução

Except for the OCR part, the right bundle would be pandas and sklearn.

You can check this ipython notebook which uses TfidfVectorizer and SVC Classifier.

This classifier can make one-vs-one or one-vs-the-rest multiclass predictions, and if you use the predict_proba method instead of predict, you would have the confidence level of each category.

If you're looking for performances and you don't need prediction confidence levels, you should use LinearSVC which is way faster.

Sklearn is very well documented and you will find everything you need for text classification.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange