htmlpurifier, overpurification of third party source

https://stackoverflow.com/questions/4221035

26-09-2019
|

Question

UPDATE 2: http://htmlpurifier.org/phorum/read.php?3,5088,5113 Author has already identified the problem.

UPDATE: Issue appears to be exclusive to version 4.2.0. I have downgraded to 4.1.0 and it works. Thank you for all your help. Author of package notified.

I am scraping some pages like:

http://form.horseracing.betfair.com/horse-racing/010108/Catterick_Bridge-GB-Cat/1215

According to W3C validation it is valid XHTML Strict.

I am then using http://htmlpurifier.org/ to purify the HTML before loading into a DOMDocument. However it is only returning a single line of content.

Output:

12:15 Catterick Bridge - Tuesday 1st January 2008 - Timeform | Betfair

Code:

echo $content; # all good
$purifier = new \HTMLPurifier();
$content = $purifier->purify($content);
echo $content; # all bad

BTW it works for data sourced from another site, just as you say leaves the title for all pages from this domain.

Related Links

HTMLPurifier dies when the following code is run through it (unanswered question on similar topic)

Solution

You should not need the HTML purifier. The DOMDocument class will take care of everything for you. However, it will trigger a warning on invalid html, so just do this:

$doc = new DOMDocument();
@$doc->loadHTML($content);

Then the error will not be triggered, and you can do what you wish with the HTML.

If you are scraping links, I would recommend that you use SimpleXMLElement::xpath(); That is much easier than working with the DOMDocument. Another example on that:

$xml = new SimpleXMLElement($content);
$result = $xml->xpath('a/@href');

print_r($result);

You can get much more complex xpaths that allow you to specifiy class names, ids, and other attributes. This is much more powerful than DOMDocument.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow