Pergunta

I'm converting a corpus of text documents into word vectors for each document. I've tried this using a TfidfVectorizer and a HashingVectorizer

I understand that a HashingVectorizer does not take into consideration the IDF scores like a TfidfVectorizer does. The reason I'm still working with a HashingVectorizer is the flexibility it gives while dealing with huge datasets, as explained here and here. (My original dataset has 30 million documents)

Currently, I am working with a sample of 45339 documents, so, I have the ability to work with a TfidfVectorizer also. When I use these two vectorizers on the same 45339 documents, the matrices that I get are different.

hashing = HashingVectorizer()
with LSM('corpus.db')) as corpus:
    hashing_matrix = hashing.fit_transform(corpus)
print(hashing_matrix.shape) 

hashing matrix shape (45339, 1048576)

tfidf = TfidfVectorizer()
with LSM('corpus.db')) as corpus:
    tfidf_matrix = tfidf.fit_transform(corpus)
print(tfidf_matrix.shape) 

tfidf matrix shape (45339, 663307)

I want to understand better the differences between a HashingVectorizer and a TfidfVectorizer, and the reason why these matrices are in different sizes - particularly in the number of words/terms.

Nenhuma solução correta

Licenciado em: CC-BY-SA com atribuição
scroll top