How can I make HTML safe for web browser with python?

https://stackoverflow.com/questions/1606201

05-07-2019
|

Question

How can I make HTML from email safe to display in web browser with python?

Any external references shouldn't be followed when displayed. In other words, all displayed content should come from the email and nothing from internet.

Other than spam emails should be displayed as closely as possible like intended by the writer.

I would like to avoid coding this myself.

Solutions requiring latest browser (firefox) version are also acceptable.

Solution

html5lib contains an HTML+CSS sanitizer. It allows too much currently, but it shouldn't be too hard to modify it to match the use case.

Found it from here.

OTHER TIPS

I'm not quite clear with what exactly you mean with "safe". It's a pretty big topic... but, for what it's worth:

In my opinion, the stripping parser from the ActiveState Cookbook is one of the easiest solutions. You can pretty much copy/paste the class and start using it.

Have a look at the comments as well. The last one states that it doesn't work anymore, but I also have this running in an application somewhere and it works fine. From work, I don't have access to that box, so I'll have to look it up over the weekend.

Use the HTMLparser module, or install BeautifulSoup, and use those to parse the HTML and disable or remove the tags. This will leave whatever link text was there, but it will not be highlighted and it will not be clickable, since you are displaying it with a web browser component.

You could make it clearer what was done by replacing the <A></A> with a <SPAN></SPAN> and changing the text decoration to show where the link used to be. Maybe a different shade of blue than normal and a dashed underscore to indicate brokenness. That way you are a little closer to displaying it as intended without actually misleading people into clicking on something that is not clickable. You could even add a hover in Javascript or pure CSS that pops up a tooltip explaining that links have been disabled for security reasons.

Similar things could be done with <IMG></IMG> tags including replacing them with a blank rectangle to ensure that the page layout is close to the original.

I've done stuff like this with Beautiful Soup, but HTMLparser is included with Python. In older Python distribs, there was an htmllib which is now deprecated. Since the HTML in an email message might not be fully correct, use Beautiful Soup 3.0.7a which is better at making sense of broken HTML.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow