Using jsoup to escape disallowed tags

https://stackoverflow.com/questions/9364540

28-10-2019
|

Question

I am evaluating jsoup for the functionality which would sanitize (but not remove!) the non-whitelisted tags. Let's say only <b> tag is allowed, so the following input

foo <b>bar</b> <script onLoad='stealYourCookies();'>baz</script>

has to yield the following:

foo <b>bar</b> &lt;script onLoad='stealYourCookies();'&gt;baz&lt;/script&gt;

I see the following problems/questions with jsoup:

document.getAllElements() always assumes <html>, <head> and <body>. Yes, I can call document.body().getAllElements() but the point is that I don't know if my source is a full HTML document or just the body -- and I want the result in the same shape and form as it came in;
how do I replace <script>...</script> with <script>...</script>? I only want to replace brackets with escaped entities and do not want to alter any attributes, etc. Node.replaceWith sounds like an overkill for this.
Is it possible to completely switch off pretty printing (e.g. insertion of new lines, etc.)?

Or maybe I should use another framework? I have peeked at htmlcleaner so far, but the given examples don't suggest my desired functionality is supported.

Solution

Answer 1

How do you load / parse your Document with Jsoup? If you use parse() or connect().get() jsoup will automaticly format your html (inserting html, body and head tags). This this ensures you always have a complete Html document - even if input isnt complete.

Let's assume you only want to clean an input (no furhter processing) you should use clean() instead the previous listed methods.

Example 1 - Using parse()

final String html = "<b>a</b>";

System.out.println(Jsoup.parse(html));

Output:

<html>
 <head></head>
 <body>
  <b>a</b>
 </body>
</html>

Input html is completed to ensure you have a complete document.

Example 2 - Using clean()

final String html = "<b>a</b>";

System.out.println(Jsoup.clean("<b>a</b>", Whitelist.relaxed()));

Output:

<b>a</b>

Input html is cleaned, not more.

Documentation:

Jsoup

Answer 2

The method replaceWith() does exactly what you need:

Example:

final String html = "<b><script>your script here</script></b>";
Document doc = Jsoup.parse(html);

for( Element element : doc.select("script") )
{
    element.replaceWith(TextNode.createFromEncoded(element.toString(), null));
}

System.out.println(doc);

Output:

<html>
 <head></head>
 <body>
  <b>&lt;script&gt;your script here&lt;/script&gt;</b>
 </body>
</html>

Or body only:

System.out.println(doc.body().html());

Output:

<b>&lt;script&gt;your script here&lt;/script&gt;</b>

Documentation:

Answer 3

Yes, prettyPrint() method of Jsoup.OutputSettings does this.

Example:

final String html = "<p>your html here</p>";

Document doc = Jsoup.parse(html);
doc.outputSettings().prettyPrint(false);

System.out.println(doc);

Note: if the outputSettings() method is not available, please update Jsoup.

Output:

<html><head></head><body><p>your html here</p></body></html>

Documentation:

Document.OutputSettings.prettyPrint(boolean pretty)

Answer 4 (no bullet)

No! Jsoup is one of the best and most capable Html library out there!

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow