LDA for Html Documents in Genism

https://stackoverflow.com/questions/22361438

python
gensim

13-06-2023
|

Domanda

I have bunch of html documents 10-15 on which i have to apply LDA algorithm in gensim I am stuck on creating the corpus as i don't understand how i design a corpus for a collection of html documents. The example on the site shows the creation of them on wikipedia compressed file .xml.bz

Anyone please guide me how can i apply LDA on bunch of html documents. Thanks in advance

Soluzione

Check out HTML processing libraries, like lxml or beautifulsoup.

For higher level processing (removal of boilerplate, extracting plain text from HTML), have a look at e.g. Honza Pomikalek's jusText package.

Once you have plain text documents, you can proceed as per gensim's tutorials.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow