HTML Sanitization in C++

https://stackoverflow.com/questions/764540

11-09-2019
|

Question

Is there any available C++ (or maybe C) function/class/library with only purpose to sanitize a string that might contain HTML?

I find a lot of source code for sanitizing in C# or other languages more used in web application but nothing in C++.

I'll try to implement my own function if I don't find any available but I think an heavily tested solution would be far better.

edit> Some more precisions on my needs :

I'm getting text input from keyboard in my C++ application. Then I need to sanitize it before using it as a parameter in a javascript function call. That javascript run in a loaded html page that is automatically rendered (via Chromium) in a texture that I display via a library (Navi). So, the javascript function I use will simply take the given text, put P tag around the text and inject it in a div like this :

text_display.innerHTML += text_to_add;

I need to sanitize the text before sending it to the web page, be it for this function or another. It just have to be sanitized before input in Chromium.

Solution

HTML Tidy is written in c, but there are bindings for practically every language/platform, including c++.

OTHER TIPS

You could use libxml2's xmlEncodeSpecialChars.

You are asking quite the question here. Before you are going to get a good answer, you need to be clear on what exactly you want to "parse" OUT of your input. For example, you could look for any "<" chars, and convert them to something else, so they are not parsed by any HTML parser.

Or, you could search for the pattern of < and > followed by < / > pattern. (Excuse the space, I had to put it in here so the HTML parser HERE would not eat it). Then, you also need to look for the "< single element tags / >" as well.

You can actually look for valid/known HTML tags and strip THOSE out.

So, the question becomes, which method is correct for your solution? Knowing that if you make a simple parser, you may actually rip valid text out that contains greater-than, and less-than symbols.

So, here is my answer for you thus far.

If you want to simply REMOVE any HTML-esque style text, I'd recommend going with a regular expression engine (PCRE), and using it to parse your input, and remove all the matched strings. This probably the easy solution, but it does require you get and build PCRE, and there a GPL issues you need to be aware of, for your project. The parsing would probably be really easy to implement, and run quick.

The second option is to do it by walking a buffer, looking for the open HTML char (<), then parsing until you hit the first white space, then start walking, looking for the closing HTML char (>), then start walking again, looking for the matching CLOSING tag, based on what you just parsed. (Say, it's a DIV tag, you want to look for /DIV.)

I have code that does this in an STL HTML parser, but there are a lot of issues to consider going this route also. For example, you have entity codes to deal with, single element tags like IMG, P, and BR, to name a few.

If you want some REALLY great C code to look at, go look at the ClamAV project. They have an HTML parser that strips all the tags out of a page, and leaves you with JUST the text left over. (among other things it does..). Look in the file libclamav\htmlnorm.c for a great example on 'buffer walking' and parsing. It's not the fastest thing in the world, but it does work... The latest Clam might even have so much stuff tied into the HTML parser, it might actually be difficult to understand. If so, go back and look at an earlier version, like .88.4 or so. Just please be aware of the bugs in those older code bases, there are some good ones. :)

Hope this helps.

Use Qt's QWebkit and to parse the HTML Tree. Then spit the output with it. This would def clean up the html a little bit.

This was posted a few hours ago. It's just an article about regex's, but it happens to contain exactly what you want :) and I think this might be of interest as well.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow