Assume $html_dom contains a page that has HTML entities like  . In the output below, I get an output like this  .

$html_dom = new DOMDocument();
@$html_dom->loadHTML($html_doc);
$xpath = new DOMXPath($html_dom);

$query   = '//div[@class="foo"]/div/p';
$my_foos = $xpath->query($query_abstract);
foreach ($my_foos as $my_foo)
{
    echo html_entity_decode($my_foos->nodeValue);
    die;
}

How do I handle this properly so that I don't get weird characters? I tried the following with no success:

$html_doc = mb_convert_encoding($html_doc, 'HTML-ENTITIES', 'UTF-8');
$html_dom = new DOMDocument();
$html_dom->resolveExternals = TRUE;
@$html_dom->loadHTML($html_doc);
$xpath = new DOMXPath($html_dom);

$query   = '//div[@class="foo"]/div/p';
$my_foos = $xpath->query($query);
foreach ($my_foos as $my_foo)
{
    echo html_entity_decode($my_foos->nodeValue);
    die;
}
有帮助吗?

解决方案

mb_convert_encoding was a good idea, but it does not work as expected because DOMDocument seems to be a little big buggy when it comes to encoding.

Moving the mb_convert_encoding to the actual node output did the trick.

$html_dom = new DOMDocument();
$html_dom->resolveExternals = TRUE;
@$html_dom->loadHTML($html_doc);
$xpath = new DOMXPath($html_dom);

$query   = '//div[@class="foo"]/div/p';
$my_foos = $xpath->query($query);
foreach ($my_foos as $my_foo)
{
    echo mb_convert_encoding($my_foo->nodeValue, 'HTML-ENTITIES', 'UTF-8');
    die;
}
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top