By default (in sklearn <= 0.13) min_df
is set to min_df=2
which means that each word must at least occur in 2 different documents from the corpus to be included in the vectorizer's vocabulary. While this is a reasonable choice for large corporas, it's too restrictive to get anything included in a toy dataset with just a couple of sentences, hence the error message you get which I find pretty explicit. The min_df=2
default has been changed to min_df=1
in the development branch of scikit-learn to be less confusing to new users who try the library with default parameter value on toy datasets.
Python Scikit-learn: Empty Vocabulary in TF-IDF
-
30-05-2022 - |
質問
I am using the code given in most up-voted answer to this question (Similarity between two text documents) to calculate TF-IDF between documents. However, I observe that when I run the code WITHOUT specifying a custom value of min_df
(1, in the code), then if two documents are completely different (such that there is no common word in them), instead of receiving a TF-IDF value of 0, I get the following error:
ValueError: empty vocabulary; training set may have contained only stop words or min_df (resp. max_df) may be too high (resp. too low).
Can somebody tell me how can I get rid of this error?
解決
所属していません StackOverflow