Python Scikit-learn: Empty Vocabulary in TF-IDF

https://stackoverflow.com/questions/16682263

30-05-2022
|

Question

I am using the code given in most up-voted answer to this question (Similarity between two text documents) to calculate TF-IDF between documents. However, I observe that when I run the code WITHOUT specifying a custom value of min_df (1, in the code), then if two documents are completely different (such that there is no common word in them), instead of receiving a TF-IDF value of 0, I get the following error:

ValueError: empty vocabulary; training set may have contained only stop words or min_df (resp. max_df) may be too high (resp. too low).

Can somebody tell me how can I get rid of this error?

Solution

By default (in sklearn <= 0.13) min_df is set to min_df=2 which means that each word must at least occur in 2 different documents from the corpus to be included in the vectorizer's vocabulary. While this is a reasonable choice for large corporas, it's too restrictive to get anything included in a toy dataset with just a couple of sentences, hence the error message you get which I find pretty explicit. The min_df=2 default has been changed to min_df=1 in the development branch of scikit-learn to be less confusing to new users who try the library with default parameter value on toy datasets.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow