TF-IDF for Topic Modeling

https://datascience.stackexchange.com/questions/80821

13-12-2020
|

Pergunta

Can TF-IDF be used a sole method for Topic Modeling ? (I know there are better methods like LDA , LSA etc)

I just want to understand if TF-IDF alone can help us in Topic modeling . If yes , can someone explain how that simple framework works ?

I want to understand the application and capabilities of TF-IDF as a sole method for Topic Modeling. I could not find this anywhere else in the internet .

Solução

Formally the problem of topic modelling is a clustering problem: given a collection of text documents, group together the documents which are topically similar.

So technically it can indeed be done with a TF-IDF representation of documents as follows:

Collect the global vocabulary across all the documents and calculate the IDF for every word.
Represent every document as a TF-IDF vector the usual way: for every word, obtain the term frequency in the document (TF) then multiply by the global IDF for this word (IDF). Note that every vector must represent the document over the global vocabulary.
Use any clustering method over the vector representations of the documents: K-means, hierarchical clustering, etc.

Note that this method is unlikely to be as good as state of the art methods for topic modelling.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange