Question

I'm a beginner in vector space model (VSM). And i tried the code from this site. It's a very good intoduction to VSM but i somehow managed to get different results from the author. It might be because of some compatibility problem as scikit learn seems to have changed a lot since the introduction was written. It might be that i misunderstood the explanation as well.
I used the code below to get the wrong answer. Can someone figure out what is wrong with it? I post the result of the code below and the right answer below

I have done the computation by hand so i know that the results of website are good. There is another Stackoverflow question that use the same code but it doesn't get the same results as the website either.

import numpy, scipy, sklearn

train_set = ("The sky is blue.","The sun is bright.")
test_set = ("The sun is the sky is bright.", "We can see the shining sun, the bright sun.")

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words= 'english')

vectorizer.fit_transform(train_set)


smatrix = vectorizer.transform(test_set)


from sklearn.feature_extraction.text import TfidfTransformer


tfidf = TfidfTransformer(norm='l2', sublinear_tf=True)


tfidf.fit(smatrix)
#print smatrix.todense()
print tfidf.idf_

tf_idf_matrix = tfidf.transform(smatrix)
print tf_idf_matrix.todense()

results vector of tf-idf
#[ 2.09861229 1. 1.40546511 1. ]

right vector of tf-idf
#[0.69314718, -0.40546511, -0.40546511, 0]

results tf_idf_matrix
#[[ 0. 0.50154891 0.70490949 0.50154891]
#[ 0. 0.50854232 0. 0.861037 ]]

right answer
# [[ 0. -0.70710678 -0.70710678 0. ]
# [ 0. -0.89442719 -0.4472136 0. ]]

Was it helpful?

Solution

It's not your fault, it's because of different formula used in current sklearn and the one used in the tutorial.

The current version of sklearn uses this formula (source):

idf = log ( n_samples / df ) + 1

where n_samples refers to the total number of documents (|D| in the tutorial) and df refers to the number of documents in which the term appears ({d:t_1 \in D} in the tutorial).

To deal with zero division, they by default use smoothing (option smooth_idf=True in TfidfVectorizer, see documentation) that changes the df and n_samples values like this, so those values would be at least 1:

df += 1
n_samples += 1

While the one in the tutorial uses this formula:

idf = log ( n_samples / (1+df) )

So, you can't get the exact same result as the one in the tutorial, unless you change the formula in the source code.

Edit:

Strictly speaking, the right formula is log(n_samples/df), but since it causes the zero-division problem in practice, people try to modify the formula to allow it to be used in all cases. The most common one is like you said: log(n_samples/(1+df)), but it's not wrong also to use the formula log(n_samples/df)+1 given that you've already smoothed it beforehand. But reading the code history, it seems that they did that so that they won't have negative IDF value (as discussed in this pull request and later updated in this fix). Another way to remove negative IDF value is simply by converting negative values to 0. I have yet to find which one is the more commonly used method.

They did agree that the way they do it is not the standard way. So you can safely say that log(n_samples/(1+df)) is the correct way.

To edit the formula, first I must warn you that this will affect every user that uses the code, make sure you know what you're doing.

You can just go to the source code (in Unix: it's at /usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py, in Windows: I'm not using Windows now, but you can search for the file "text.py") and edit the formula directly. You might need administrator/root access, depending on the platform you use.

Additional note:

As an additional note, the order of terms in the vocabulary is also different (at least in my machine), so to get the exact same result (if the formula is the same), you also need to pass in the exact same vocabulary as shown in the tutorial. So using your code:

vocabulary = {'blue':0, 'sun':1, 'bright':2, 'sky':3}
vectorizer = CountVectorizer(vocabulary=vocabulary) # You don't need stop_words if you use vocabulary
vectorizer.fit_transform(train_set)
print 'Vocabulary:', vectorizer.vocabulary_
# Vocabulary: {'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top