Question

I want to see the cleaned HTML that we get from HTMLCleaner. I see there is a method called serialize on TagNode, however don't know how to use it. Does anybody have any sample code for it?

Thanks Nayn

Was it helpful?

Solution

Here's the sample code:

HtmlCleaner htmlCleaner = new HtmlCleaner();

TagNode root = htmlCleaner.clean(url);

HtmlCleaner.getInnerHtml(root);

String html = "<" + root.getName() + ">" + htmlCleaner.getInnerHtml(root) + "</" + root.getName() + ">";

OTHER TIPS

Use a subclass of org.htmlcleaner.XmlSerializer, for example:

// get the element you want to serialize
HtmlCleaner cleaner     = new HtmlCleaner();
TagNode     rootTagNode = cleaner.clean(url);

// set up properties for the serializer (optional, see online docs)
CleanerProperties cleanerProperties = cleaner.getProperties();
cleanerProperties.setOmitXmlDeclaration(true);

// use the getAsString method on an XmlSerializer class
XmlSerializer xmlSerializer = new PrettyXmlSerializer(cleanerProperties);
String        html          = xmlSerializer.getAsString(rootTagNode);
XmlSerializer xmlSerializer = new PrettyXmlSerializer(cleanerProperties);

String html = xmlSerializer.getAsString(rootTagNode);

the method above has a problem,it will trim content in html label, for example,

this is paragraph1.

 will become 

this is paragraph1.

and it is getSingleLineOfChildren function does the trim operation. So if we fetch data from website and want to keep the format like tuckunder.

PS:if a html label has children label,the parent label contetn will not be trimed,

for example <p> this is paragraph1. <a>www.xxxxx.com</a> </p> will keep whitespace before "this is paragraph1"

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top