Document classification: tf-idf prior to or after feature filtering?

https://datascience.stackexchange.com/questions/2677

16-10-2019
|

Question

I have a document classification project where I am getting site content and then assigning one of numerous labels to the website according to content.

I found out that tf-idf could be very useful for this. However, I was unsure as to when exactly to use it.

Assumming a website that is concerned with a specific topic makes repeated mention of it, this was my current process:

Retrieve site content, parse for plain text
Normalize and stem content
Tokenize into unigrams (maybe bigrams too)
Retrieve a count of each unigram for the given document, filtering low length and low occurrence words
Train a classifier such as NaiveBayes on the resulting set

My question is the following: Where would tf-idf fit in here? Before normalizing/stemming? After normalizing but before tokenizing? After tokenizing?

Any insight would be greatly appreciated.

Edit:

Upon closer inspection, I think I may have run into a misunderstanding at to how TF-IDF operates. At the above step 4 that I describe, would I have to feed the entirety of my data into TF-IDF at once? If, for example, my data is as follows:

[({tokenized_content_site1}, category_string_site1), 
 ({tokenized_content_site2}, category_string_site2), 
...
 ({tokenized_content_siten}, category_string_siten)}]

Here, the outermost structure is a list, containing tuples, containing a dictionary (or hashmap) and a string.

Would I have to feed the entirety of that data into the TF-IDF calculator at once to achieve the desired effect? Specifically, I have been looking at the scikit-learn TfidfVectorizer to do this, but I am a bit unsure as to its use as examples are pretty sparse.

Solution

As you've described it, Step 4 is where you want to use TF-IDF. Essentially, TD-IDF will count each term in each document, and assign a score given the relative frequency across the collection of documents.

There's one big step missing from your process, however: annotating a training set. Before you train your classifier, you'll need to manually annotate a sample of your data with the labels you want to be able to apply automatically using the classifier.

To make all of this easier, you might want to consider using the Stanford Classifier. It will perform the feature extraction and build the classifier model (supporting several different machine learning algorithms), but you'll still need to annotate the training data by hand.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange