Cleaning mixed type <script> tags

https://stackoverflow.com/questions/2713454

01-10-2019
|

Question

I'm cleaning HTML using cyberneko and xerces. However , some $#@@!@@ websites still use BOTH

<script>...</script> and <script.../>

So what happens is this : given

<script..../> <div> Some Text </div> <script> scripting stuff </script> ,

neko parses all the above line as a script , so I get

<script..../> &lt div &gt Some Text &lt/div &gt &lt script &gt scripting stuff </script> ,

And then I lose all the inside content :(

Any advice?

Solution

Using <script /> is illegal in html. It is legal in xml. I don't know why some people still use the xml way to write html, but it's wrong, and it breaks most of the parsers (like SO..) - by design.

Another thing to notice - if you use xml parsers / dom4j parsers or any other thing that depends on it , make sure you're not passing your string through an xml parser and then an html parser - this will break everything.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow