Convert HTML to Plain Text?

https://stackoverflow.com/questions/791311

16-09-2019
|

Question

I am able to read emails in from Microsoft Exchange using an IMAP Client from Lumisoft. I have set the exchange server settings to convert any mail to plain text. However, when I read in the information it still seems to contain HTML/CSS.

What is the best way of removing HTML/CSS from the body of an email? Or is there a setting on the exchange server I seemed to have missed?

Solution

I usually take one of these approaches...

Using regular expressions. It can be a bit difficult to get right if you have to come up with a solution that also works with all kinds of invalid markup, but i bet someone else has done it before you (Hint: google or search SO).
Using an HTML parser library. You can find one for any popular programming language out there. I recommend using the Html Agility Pack.

OTHER TIPS

I'm not sure of exactly how your setup works, if you can run scripts, etc. An HTML parser would be the best way to parse the HTML, obviously. For instance, with Hpricot (a Ruby HTML-parsing library), you could do puts doc.find_element('body').inner_text and that would print the text content of the document.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow