How to deal with very uncommon terms in tf-idf?

https://stackoverflow.com/questions/12949724

08-07-2021
|

Question

I'm implementing a naive "keyword extraction algorithm". I'm self-taught though so I lack some terminology and maths common in the online literature.

I'm finding "most relevant keywords" of a document thus:

I count how often each term is used in the current document. Let's call this tf.
I look up how often each of those terms is used in the entire database of documents. Let's call this df.
I calculate a relevance weight r for each term by r = tf / df.

Each document is a proper subset of the corpus so no document contains a term not in the corpus. This means I don't have to worry about division by zero.

I sort all terms by their r and keep however many of the top terms. These are the top keywords most closely associated with this document. Terms that are common in this document are more important. Terms that are common in the entire database of documents are less important.

I believe this is a naive form of tf-idf.

The problem is that when terms are very uncommon in the entire database but occur in the current document they seem to have too high an r value.

This can be thought of as some kind of artefact due to small sample size. What is the best way or the usual ways to compensate for this?

Throw away terms less common in the overall database than a certain threshold. If so how is that threshold calculated? It seems it would depend on too many factors to be a hard-coded value.
Can it be weighted or smoothed by some kind of mathematical function such as inverse square or cosine?

I've tried searching the web and reading up on tf-idf but much of what I find deals with comparing documents, which I'm not interested in. Plus most of them have a low ratio of explanation vs. jargon and formulae.

(In fact my project is a generalization of this problem. I'm really working with tags on Stack Exchange sites so the total number of terms is small, stopwords are irrelevant, and low-usage tags might be more common than low-usage words in the standard case.)

Solution

I spent a lot of time trying to do targeted Google searches for particular tf-idf information and dug through many documents.

Finally I found a document with clear and concise explanation accompanied by formulae even I can grok: Document Processing and the Semantic Web, Week 3 Lecture 1: Ranking for Information Retrieval by Robert Dale of the Department of Computing at Macquarie University:

Page 20:

PDF page 20

The two things I was missing was taking into account the number of documents in the collection, and using the logarithm of the inverse df rather than using the inverse df directly.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow