Domanda

I have bunch of html documents 10-15 on which i have to apply LDA algorithm in gensim I am stuck on creating the corpus as i don't understand how i design a corpus for a collection of html documents. The example on the site shows the creation of them on wikipedia compressed file .xml.bz

Anyone please guide me how can i apply LDA on bunch of html documents. Thanks in advance

È stato utile?

Soluzione

Check out HTML processing libraries, like lxml or beautifulsoup.

For higher level processing (removal of boilerplate, extracting plain text from HTML), have a look at e.g. Honza Pomikalek's jusText package.

Once you have plain text documents, you can proceed as per gensim's tutorials.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top