Question

How do I strip all attributes from HTML tags in a string, except "alt" and "src" using Java?

And further.. how do I get the content from all "src" attributes in the string?

:)

Was it helpful?

Solution 2

OK, solved this somehow.

Used the HTMLCleaner library to parse the input data to a valid format.

Then I use a DOM parser to iterate over everything, and strip all disallowed tags and attributes.

(and some minor ugly hacks;) )

This was kind of a lot of work.

OTHER TIPS

You can:

  • Implement a SAX parser;
  • Built a document with a DOM parser, walk it and prune it and then convert back to HTML; or
  • Use an identity transform in XSLT (assuming your HTML is in XHTML format or can be converted to that with, say, JTidy) with some additional cases to remove attributes you don't want.

Whatever you do, don't try and do it with regular expressions.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top