Pergunta

Why did the tf-idf model in gensim throws away the terms and counts after i transform the corpus?

My code:

from gensim import corpora, models, similarities

# Let's say you have a corpus made up of 2 documents.
doc0 = [(0, 1), (1, 1)]
doc1 = [(0,1)]
doc2 = [(0, 1), (1, 1)]
doc3 = [(0, 3), (1, 1)]

corpus = [doc0,doc1,doc2,doc3]

# Train a tfidf model using the corpus
tfidf = models.TfidfModel(corpus)

# Now if you print the corpus, it still remains as the flat frequency counts.
for d in corpus:
  print d
print 

# To convert the corpus into tfidf, re-initialize the corpus 
# according to the model to get the normalized frequencies.
corpus = tfidf[corpus]

for d in corpus:
  print d

Outputs:

[(0, 1.0), (1, 1.0)]
[(0, 1.0)]
[(0, 1.0), (1, 1.0)]
[(0, 3.0), (1, 1.0)]

[(1, 1.0)]
[]
[(1, 1.0)]
[(1, 1.0)]
Foi útil?

Solução

IDF is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient. In your case, all the documents has term0, so IDF for term0 is log(1), equal to 0. So in your doc-term matrix, the column for term0 is all zeros.

A term that appears in all documents has zero weight, it carries absolutely no information.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top