Comparing TFIDF vectors of different shapes

https://datascience.stackexchange.com/questions/86088

17-12-2020
|

Question

I'm working on a project using TF-IDF vectors and agglomerative clustering -- the idea is that the corpus of documents increases over time, and when a new document is added, the mean cosine similarity with each cluster will be calculated to find the best match.

However, I'm under the impression that it is costly and inefficient to re-calculate the TF-IDF vectors of every document each time a new doc is added to the corpus.

I'm trying to discern if it's possible to generate the TF-IDF vectors on an individual basis for each document (see below), and then calculate the cosine similarity by temporarily reshaping/modifying the two vectors.

doc_1 = ['some', 'text', 'here']
doc_2 = ['other', 'text', 'here']

vectorizer = TfidfVectorizer()

vector_1 = vectorizer.fit_transform(doc_1)
vector_2 = vectorizer.fit_transform(doc_2)

# somehow add the features from vector_1 to vector_2 that do not exist there, and vice versa...

# this way vectors can be calculated on the fly without having to regenerate vectors with an ever-expanding corpus

It's my understanding that running the sklearn TF-IDF vectoriser returns not only the vectors but a mapping between the unique words in the corpus and the vectors they correspond to, so I think this approach would be possible, but I'm not sure?

Finally, as I'm new to data science, am I approaching this issue horribly? Is there a much cleaner or efficient way of solving the problem of non-parametrical clustering (i.e. no n_clusters) with an ever-expanding corpus.

Solution

Your thinking is correct, it is unefficient to recalculate TFIDF on the whole collection every time a document is added.

There are two parts in TF-IDF:

$TF(w,d)$ is the Term Frequency of a word $w$ in a particular document $d$, it's independent from other documents. It's just defined as the proportion of $w$ in $d$.
$IDF(w)$ is the Inverse Document Frequency: it is defined as the log of the inverse of the document frequency, where the document frequency of a word is the proportion of documents in the collection which contain this word:

$$IDF(w)=log\left(\frac{|D|}{| \{d \in D \text{ such that } w\in d \} |}\right)$$

$IDF(w)$ depends on all the documents so it's costly to re-calculate. However updating IDF whenever there is a new documents can be implemented efficiently:

Maintain a global map of the document frequency (not the IDF) for every word. This map is easy to update: for every distinct word $w$ in the new document, increment $m[w]$.
Whenever a TF-IDF weight is needed, collect the required document frequency and apply the IDF formula.

I don't know whether this can be done with scikit API or not, but it's not too big a task to implement it yourself (and it's a good little exercise to understand TFIDF ;) )

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange