Pergunta

I'm trying to retrieve list of topics from a large corpus of news articles, I'm planning to use gensim to extract a topic distribution for each document using LDA. I want to know the format of processed articles required by gensim implementation of lda and how to convert raw articles to that format. I saw this link about using lda on wikipedia dump but I found the corpus to be in a processed state whose format was not mentioned anywhere

Foi útil?

Solução

There is an offline learning step and an online feature creation step.

Offline Learning

Assume you have a big corpus such as Wikipedia or downloaded a bunch of news articles.

For each article/document:

  1. You get the raw text
  2. You lemmatize it. Gensim has utils.lemmatize
  3. You create a dictionary
  4. You create a bag of word representation

Then you train the TF-IDF model and convert the whole corpus to the TF-IDF space. Finally, you train the LDA model on the "TF-IDF corpus".

Online

With an incoming news article you do almost the same:

  1. Lemmatize it
  2. Create a bag of word representaiton using the dictionary.
  3. Convert it to TF-IDF space using the TF-IDF model
  4. Convert it to LDA space.

Outras dicas

I don't know if I got the problem right, but gensim supports multiple corpora. You can find a list of them here.

If you want to process natural language, you have to tokenize the text first. You can follow the step-by-step tutorial on the gensim website here. It's explained pretty well.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top