Pergunta

What I want to do is, get a text training set (natural language) and increase this set with automatically created text that tries to mimic the text content. I'm using a bag-of-words assumption, sequence doesn't matter, syntax doesn't matter, I just want to create text that contains words that is pertinent with the general topic of the base.

Right now I'm using Latent Dirichlet Allocation to classify my documents in topics distributions, average the topic distribution of my set, and generate documents from these topic distribution.

I want to know two things:

1- Is there a better way to do that?

2- Can I train LDA with texts that are not of the domain of my set, without tainting my topics: Eg. The set that I want to increase has texts about politics. Can I train my model with any kind of text (cars, fashion, musics) and classificates my base of politics text get its topics distributions and generates similar text from this distribution.

I'm using python 2.7 and gensim.

Foi útil?

Solução

NLTK's generate() function may be what you're looking for.

From the docs:

generate(length=100)

Print random text, generated using a trigram language model.

Parameters:

length (int) – The length of text to generate (default=100)

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top