Check out HTML processing libraries, like lxml
or beautifulsoup
.
For higher level processing (removal of boilerplate, extracting plain text from HTML), have a look at e.g. Honza Pomikalek's jusText package.
Once you have plain text documents, you can proceed as per gensim's tutorials.