Pergunta

I have bunch of html documents 10-15 on which i have to apply LDA algorithm in gensim I am stuck on creating the corpus as i don't understand how i design a corpus for a collection of html documents. The example on the site shows the creation of them on wikipedia compressed file .xml.bz

Anyone please guide me how can i apply LDA on bunch of html documents. Thanks in advance

Foi útil?

Solução

Check out HTML processing libraries, like lxml or beautifulsoup.

For higher level processing (removal of boilerplate, extracting plain text from HTML), have a look at e.g. Honza Pomikalek's jusText package.

Once you have plain text documents, you can proceed as per gensim's tutorials.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top