error in computing text similarity using scikit learn

Question

It's not your fault, it's because of different formula used in current sklearn and the one used in the tutorial.

The current version of sklearn uses this formula (source):

idf = log ( n_samples / df ) + 1

where n_samples refers to the total number of documents (|D| in the tutorial) and df refers to the number of documents in which the term appears ({d:t_1 \in D} in the tutorial).

To deal with zero division, they by default use smoothing (option smooth_idf=True in TfidfVectorizer, see documentation) that changes the df and n_samples values like this, so those values would be at least 1:

df += 1
n_samples += 1

While the one in the tutorial uses this formula:

idf = log ( n_samples / (1+df) )

So, you can't get the exact same result as the one in the tutorial, unless you change the formula in the source code.

Edit:

Strictly speaking, the right formula is log(n_samples/df), but since it causes the zero-division problem in practice, people try to modify the formula to allow it to be used in all cases. The most common one is like you said: log(n_samples/(1+df)), but it's not wrong also to use the formula log(n_samples/df)+1 given that you've already smoothed it beforehand. But reading the code history, it seems that they did that so that they won't have negative IDF value (as discussed in this pull request and later updated in this fix). Another way to remove negative IDF value is simply by converting negative values to 0. I have yet to find which one is the more commonly used method.

They did agree that the way they do it is not the standard way. So you can safely say that log(n_samples/(1+df)) is the correct way.

To edit the formula, first I must warn you that this will affect every user that uses the code, make sure you know what you're doing.

You can just go to the source code (in Unix: it's at /usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py, in Windows: I'm not using Windows now, but you can search for the file "text.py") and edit the formula directly. You might need administrator/root access, depending on the platform you use.

Additional note:

As an additional note, the order of terms in the vocabulary is also different (at least in my machine), so to get the exact same result (if the formula is the same), you also need to pass in the exact same vocabulary as shown in the tutorial. So using your code:

vocabulary = {'blue':0, 'sun':1, 'bright':2, 'sky':3}
vectorizer = CountVectorizer(vocabulary=vocabulary) # You don't need stop_words if you use vocabulary
vectorizer.fit_transform(train_set)
print 'Vocabulary:', vectorizer.vocabulary_
# Vocabulary: {'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3}