lda.collapsed.gibbs.sampler model and top words ranking

https://stackoverflow.com/questions/21341978

02-10-2022
|

Pergunta

I have a model generated by the function lda.collapsed.gibbs.sampler, from the lda package, and i need to know the "relevance" of the top words. When using the

    top.topic.words(result$topics, 10, by.score=TRUE)

i get a list of top 10 words for each topic, but i'd like to see the percentage of the topic that those 10 words represent. I guess the information exists, because there is a "score", but I'm not really familiar with the statistical methods of the Gibbs sampler.

Thanks in advance!

Solução

I think something like this may be what you want:

for (ii in 1:nrow(result$topics)) {
  print(
    head(
      cumsum(
        sort(result$topics[ii,], decreasing=TRUE)
      ),
      n = 20
    ) / result$topic_sums[ii]
  ) 
}

Let's break it down. If you want the fraction of Gibbs assignments, then that is easy. The LDA routine returns the number of assignments to each (word, topic) pair. So all you have to do is sort each row of the result$topics to get the top words (this is essentially what top.topic.words does if you set by.score=FALSE). Once you have it in sorted order you can just see, for each topic, how many counts occur for that word versus for the entire topic. To do that I divide by result$topic_sums which contains the total number of assignments of that topic. Finally, I use cumsum so you can see the running total weight for words in that topic.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow