Question

Why did the tf-idf model in gensim throws away the terms and counts after i transform the corpus?

My code:

from gensim import corpora, models, similarities

# Let's say you have a corpus made up of 2 documents.
doc0 = [(0, 1), (1, 1)]
doc1 = [(0,1)]
doc2 = [(0, 1), (1, 1)]
doc3 = [(0, 3), (1, 1)]

corpus = [doc0,doc1,doc2,doc3]

# Train a tfidf model using the corpus
tfidf = models.TfidfModel(corpus)

# Now if you print the corpus, it still remains as the flat frequency counts.
for d in corpus:
  print d
print 

# To convert the corpus into tfidf, re-initialize the corpus 
# according to the model to get the normalized frequencies.
corpus = tfidf[corpus]

for d in corpus:
  print d

Outputs:

[(0, 1.0), (1, 1.0)]
[(0, 1.0)]
[(0, 1.0), (1, 1.0)]
[(0, 3.0), (1, 1.0)]

[(1, 1.0)]
[]
[(1, 1.0)]
[(1, 1.0)]
Was it helpful?

Solution

IDF is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient. In your case, all the documents has term0, so IDF for term0 is log(1), equal to 0. So in your doc-term matrix, the column for term0 is all zeros.

A term that appears in all documents has zero weight, it carries absolutely no information.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top