Question

Wondering if anyone else has come across this problem and if they've found a solution.

I have an application that parses webpages (that I don't control) using QueryPath's htmlqp() method.

The problem I'm having is that anytime a page is parsed that happens to have an in-line <script> tag w/ some javacript in it, that also has some kind of HTML string referenced in it, QueryPath's writeHTML() method attempts to "fix" the HTML by putting in line breaks, closing tags and other nonsense into the javascript, which in-turn breaks all the javascript (and in some cases, the HTML) on the page.

For example:

<script>
     var $jQ = jQuery.noConflict();
     // Use jQuery via $jQ(...)
     $jQ(document).ready(function(){
       $jQ("#mktFrmSubmit").wrap("<div class='buttonSubmit'></div>");
       $jQ(".buttonSubmit").prepend("<span></span>");
     });
   </script>

-becomes-

<script>
     var $jQ = jQuery.noConflict();
     // Use jQuery via $jQ(...)
     $jQ(document).ready(function(){
       $jQ("#mktFrmSubmit").wrap("<div class='buttonSubmit'></script>
</div>");
       $jQ(".buttonSubmit").prepend("<span></span>");
     });

Clearly the latter breaks demonstrably.

Does anyone know how to keep QueryPath from doing this? Or maybe get it to just ignore what's in the body of <script> tags in general?

Thanks.

Was it helpful?

Solution

What we suggest is using the HTML5-PHP library to parse the HTML. The older HTML4.01 parser that is built into PHP (via libxml) is not particularly great with JavaScript. But the newer HTML5 PHP library was built to handle such cases.

Here's the library:

https://github.com/Masterminds/html5-php

And Matt Farina wrote up an excellent intro to using these two libraries together:

http://engineeredweb.com/blog/2014/querypath-html5-php/

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top