Caluculating IDF(Inverse Document Frequency) for document categorization

https://stackoverflow.com/questions/11947748

26-06-2021
|

Question

I have doubt in calculating IDF (Inverse Document Frequency) in document categorization. I have more than one category with multiple documents for training. I am calculating IDF for each term in a document using following formula:

IDF(t,D)=log(Total Number documents/Number of Document matching term);

My questions are:

What does "Total Number documents in Corpus" mean? Whether the document count from a current category or from all available categories?
What does "Number of Document matching term" mean? Whether the term matching document count from a current category or from all available categories?

Solution

Total Number documents in Corpus is simply the amount of documents you have in your corpus. So if you have 20 documents then this value is 20.

Number of Document matching term is the count of in how many documents the term t occurs. So if you have 20 documents in total and the term t occurs in 15 of the documents then the value for Number of Documents matching term is 15.

The value for this example would thus be IDF(t,D)=log(20/15) = 0.1249

Now if I'm correct, you have multiple categories per document and you want to able to categorize new documents with one or more of these categories. One method to do this would be to create one documents for each category. Each category-document should hold all texts which are labelled with this category. You can then perform tf*idf on these documents.

A simple way of categorizing a new document could then be achieved by summing the term values of the query using the different term values calculated for each category. The category whose term values, used to calculate the product, result in the highest outcome will then be ranked 1st.

Another possibility is to create a vector for the query using the idf of each term in the query. All terms which don't occur in the query are given the value of 0. The query-vector can then be compared for similarity to each category-vector using for example cosine similarity.

Smoothing is also a useful technique to deal with words in a query which don't occur in your corpus.

I'd suggest reading sections 6.2 and 6.3 of "Introduction to Information Retrieval" by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze.

OTHER TIPS

I have written a small post describing term frequency-inverse document frequency here: http://bigdata.devcodenote.com/2015/04/tf-idf-term-frequency-inverse-document.html

Here is a snippet from the post:

TF-IDF is the most fundamental metric used extensively in classification of documents. Let us try and define these terms:

Term frequency basically is significant of the frequency of occurrence of a certain word in a document compared to other words in the document.

Inverse Document frequency on the other hand is significant of the occurrence of the word in all the documents for a given collection (of documents which we want to classify into different categories).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow