Document Features Vector Representation

https://stackoverflow.com/questions/12046213

27-06-2021
|

Question

I am building a document classifier to categorize documents.

So first step is to represent each documents as "features vector" for the training purpose.

After some research, I found that I can use either the Bag of Words approach or N-gram approach to represent a document as a vector.

The text in each document (scanned pdfs and images) is retrieved using an OCR, thus some words contain errors. And I don't have previous knowledge about the language used in these documents (can't use stemming).

So as far as I understand I have to use the n-gram approach. or are there other approaches to represent a document ?

I would also appreciate if someone could link me to an N-Gram guide in order to have a clearer picture and understand how it works.

Thanks in Advance

Solution

Use language detection to get document's language (my favorite tool is LanguageIdentifier from Tika project, but many others are available).
Use spell correction (see this question for some details).
Stem words (if you work in Java environment, Lucene is your choice).
Collect all N-grams (see below).
Make instances for classification by extracting n-grams from particular documents.
Build classifier.

N-gram models

N-grams are just sequences of N items. In classification by topic you normally use N-grams of words or their roots (though there are models based on N-grams of chars). Most popular N-grams are unigrams (just word), bigrams (2 serial words) and trigrams (3 serial words). So, from sentence

Hello, my name is Frank

you should get following unigrams:

[hello, my, name, is, frank] (or [hello, I, name, be, frank], if you use roots)

following bigrams:

[hello_my, my_name, name_is, is_frank]

and so on.

At the end your feature vector should have as much positions (dimensions) as there are words in all your text plus 1 for unknown words. Every position in instance vector should somehow reflect number of corresponding words in instance text. This may be number of occurrences, binary feature (1 if word occurs, 0 otherwise), normalized feature or tf-idf (very popular in classification by topic).

Classification process itself is the same as for any other domain.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow