Getting cleaned HTML in text from HtmlCleaner

https://stackoverflow.com/questions/7195812

12-01-2021
|

Question

I want to see the cleaned HTML that we get from HTMLCleaner. I see there is a method called serialize on TagNode, however don't know how to use it. Does anybody have any sample code for it?

Thanks Nayn

Solution

Here's the sample code:

HtmlCleaner htmlCleaner = new HtmlCleaner();

TagNode root = htmlCleaner.clean(url);

HtmlCleaner.getInnerHtml(root);

String html = "<" + root.getName() + ">" + htmlCleaner.getInnerHtml(root) + "</" + root.getName() + ">";

OTHER TIPS

Use a subclass of org.htmlcleaner.XmlSerializer, for example:

// get the element you want to serialize
HtmlCleaner cleaner     = new HtmlCleaner();
TagNode     rootTagNode = cleaner.clean(url);

// set up properties for the serializer (optional, see online docs)
CleanerProperties cleanerProperties = cleaner.getProperties();
cleanerProperties.setOmitXmlDeclaration(true);

// use the getAsString method on an XmlSerializer class
XmlSerializer xmlSerializer = new PrettyXmlSerializer(cleanerProperties);
String        html          = xmlSerializer.getAsString(rootTagNode);

XmlSerializer xmlSerializer = new PrettyXmlSerializer(cleanerProperties);

String html = xmlSerializer.getAsString(rootTagNode);

the method above has a problem,it will trim content in html label, for example,

this is paragraph1.

 will become

this is paragraph1.

and it is getSingleLineOfChildren function does the trim operation. So if we fetch data from website and want to keep the format like tuckunder.

PS:if a html label has children label,the parent label contetn will not be trimed,

for example <p> this is paragraph1. <a>www.xxxxx.com</a> </p> will keep whitespace before "this is paragraph1"

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow