What is the difference between a hashing vectorizer and a tfidf vectorizer
-
30-10-2019 - |
Pergunta
I'm converting a corpus of text documents into word vectors for each document. I've tried this using a TfidfVectorizer and a HashingVectorizer
I understand that a HashingVectorizer
does not take into consideration the IDF
scores like a TfidfVectorizer
does. The reason I'm still working with a HashingVectorizer
is the flexibility it gives while dealing with huge datasets, as explained here and here. (My original dataset has 30 million documents)
Currently, I am working with a sample of 45339 documents, so, I have the ability to work with a TfidfVectorizer
also. When I use these two vectorizers on the same 45339 documents, the matrices that I get are different.
hashing = HashingVectorizer() with LSM('corpus.db')) as corpus: hashing_matrix = hashing.fit_transform(corpus) print(hashing_matrix.shape)
hashing matrix shape (45339, 1048576)
tfidf = TfidfVectorizer() with LSM('corpus.db')) as corpus: tfidf_matrix = tfidf.fit_transform(corpus) print(tfidf_matrix.shape)
tfidf matrix shape (45339, 663307)
I want to understand better the differences between a HashingVectorizer
and a TfidfVectorizer
, and the reason why these matrices are in different sizes - particularly in the number of words/terms.
Nenhuma solução correta