Question

I'm trying to cluster some documents according to a tf-idf matrix using python.

First I follow the wikipedia definition of the formula, using normalised tf. http://en.wikipedia.org/wiki/Tf-idf

Feat_vectors starts as a two dimensional numpy array, with the rows representing documents and the columns representing terms, the values in each cell being the number of occurrences of each term in each document.

import numpy as np

feat_vectors /= np.max(feat_vectors,axis=1)[:,np.newaxis]
idf = len(feat_vectors) / (feat_vectors != 0).sum(0)
idf = np.log(idf)
feat_vectors *= idf

I then cluster these vectors using scipy:

from scipy.cluster import hierarchy

clusters = hierarchy.linkage(feat_vectors,method='complete',metric='cosine')
flat_clusters = hierarchy.fcluster(clusters, 0.8,'inconsistent')

However, on that last line it throws an error:

ValueError: Linkage 'Z' contains negative distances.

Cosine similarity goes from -1 to 1. However, the wikipedia page for cosine similarity states http://en.wikipedia.org/wiki/Cosine_similarity :

In the case of information retrieval, the cosine similarity of two documents will range >from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative.

So if I am getting a negative similarity, it seems that I am making some error in calculating tf-idf. Any ideas what my mistake is?

Was it helpful?

Solution

I suspect the error is in the following line:

idf = len(feat_vectors) / (feat_vectors != 0).sum(0)

since your logical vector is going to be converted to an int in the sum, and len is an int, you're losing precision. Replacing with:

idf = float(len(feat_vectors)) / (feat_vectors != 0).sum(0)

worked for me (i.e. produced what I was expecting with dummy data). Everything else looks correct.

OTHER TIPS

I know this is an old post, but seem to have stumbled into this problem myself recently. In fact, I even used the TfidfVectorizer (from sklearn.feature_extraction.text) to generate the TFIDF matrix, once my own functions gave this error. This did not help either.

It appears that the cosine metric being used for similarity induces negative values. I tried euclidean in place and it worked instantly. Here is a link to a more detailed answer for the same which I found - https://stackoverflow.com/a/2590194/3228300

Hope this helps.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top