Question

I have a need to do some simple modifications to HTML in C++, preferably without completely rewriting the HTML, such as what happens when I use libxml2 or MSHTML.

In particular I need to be able to read, and then (potentially) modify, the "src" attribute of all "img" elements. I need it to be robust enough to be able to do this with any valid HTML, but preferably without changing any of the other HTML in the process.

Are there any libraries out there that would be able to handle this? Or is this something I can do with regular expressions? I'm not too savvy with regular expressions, and I've read a lot of questions here that say you shouldn't use them to parse HTML, but I'm not clear if that applies to something like this or if that principle applies primarily to parsing in the context of building a tree from the HTML.

Was it helpful?

Solution

Regular expressions aren't recommended for HTML because they don't handle nested tags well. They should be fine for this purpose.

OTHER TIPS

Try looking at HTMLTidy

I have used it for similar things in the past.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top