I have bunch of html documents 10-15 on which i have to apply LDA algorithm in gensim I am stuck on creating the corpus as i don't understand how i design a corpus for a collection of html documents. The example on the site shows the creation of them on wikipedia compressed file .xml.bz

Anyone please guide me how can i apply LDA on bunch of html documents. Thanks in advance

有帮助吗?

解决方案

Check out HTML processing libraries, like lxml or beautifulsoup.

For higher level processing (removal of boilerplate, extracting plain text from HTML), have a look at e.g. Honza Pomikalek's jusText package.

Once you have plain text documents, you can proceed as per gensim's tutorials.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top