How to use gensim for lda on news articles?
-
28-05-2021 - |
Domanda
I'm trying to retrieve list of topics from a large corpus of news articles, I'm planning to use gensim to extract a topic distribution for each document using LDA. I want to know the format of processed articles required by gensim implementation of lda and how to convert raw articles to that format. I saw this link about using lda on wikipedia dump but I found the corpus to be in a processed state whose format was not mentioned anywhere
Soluzione
There is an offline learning step and an online feature creation step.
Offline Learning
Assume you have a big corpus such as Wikipedia or downloaded a bunch of news articles.
For each article/document:
- You get the raw text
- You lemmatize it. Gensim has utils.lemmatize
- You create a dictionary
- You create a bag of word representation
Then you train the TF-IDF model and convert the whole corpus to the TF-IDF space. Finally, you train the LDA model on the "TF-IDF corpus".
Online
With an incoming news article you do almost the same:
- Lemmatize it
- Create a bag of word representaiton using the dictionary.
- Convert it to TF-IDF space using the TF-IDF model
- Convert it to LDA space.
Altri suggerimenti