Question

I am evaluating jsoup for the functionality which would sanitize (but not remove!) the non-whitelisted tags. Let's say only <b> tag is allowed, so the following input

foo <b>bar</b> <script onLoad='stealYourCookies();'>baz</script>

has to yield the following:

foo <b>bar</b> &lt;script onLoad='stealYourCookies();'&gt;baz&lt;/script&gt;

I see the following problems/questions with jsoup:

  • document.getAllElements() always assumes <html>, <head> and <body>. Yes, I can call document.body().getAllElements() but the point is that I don't know if my source is a full HTML document or just the body -- and I want the result in the same shape and form as it came in;
  • how do I replace <script>...</script> with &lt;script&gt;...&lt;/script&gt;? I only want to replace brackets with escaped entities and do not want to alter any attributes, etc. Node.replaceWith sounds like an overkill for this.
  • Is it possible to completely switch off pretty printing (e.g. insertion of new lines, etc.)?

Or maybe I should use another framework? I have peeked at htmlcleaner so far, but the given examples don't suggest my desired functionality is supported.

Was it helpful?

Solution

Answer 1

How do you load / parse your Document with Jsoup? If you use parse() or connect().get() jsoup will automaticly format your html (inserting html, body and head tags). This this ensures you always have a complete Html document - even if input isnt complete.

Let's assume you only want to clean an input (no furhter processing) you should use clean() instead the previous listed methods.

Example 1 - Using parse()

final String html = "<b>a</b>";

System.out.println(Jsoup.parse(html));

Output:

<html>
 <head></head>
 <body>
  <b>a</b>
 </body>
</html>

Input html is completed to ensure you have a complete document.

Example 2 - Using clean()

final String html = "<b>a</b>";

System.out.println(Jsoup.clean("<b>a</b>", Whitelist.relaxed()));

Output:

<b>a</b>

Input html is cleaned, not more.

Documentation:


Answer 2

The method replaceWith() does exactly what you need:

Example:

final String html = "<b><script>your script here</script></b>";
Document doc = Jsoup.parse(html);

for( Element element : doc.select("script") )
{
    element.replaceWith(TextNode.createFromEncoded(element.toString(), null));
}

System.out.println(doc);

Output:

<html>
 <head></head>
 <body>
  <b>&lt;script&gt;your script here&lt;/script&gt;</b>
 </body>
</html>

Or body only:

System.out.println(doc.body().html());

Output:

<b>&lt;script&gt;your script here&lt;/script&gt;</b>

Documentation:


Answer 3

Yes, prettyPrint() method of Jsoup.OutputSettings does this.

Example:

final String html = "<p>your html here</p>";

Document doc = Jsoup.parse(html);
doc.outputSettings().prettyPrint(false);

System.out.println(doc);

Note: if the outputSettings() method is not available, please update Jsoup.

Output:

<html><head></head><body><p>your html here</p></body></html>

Documentation:


Answer 4 (no bullet)

No! Jsoup is one of the best and most capable Html library out there!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top