Frage

I'm trying to use DOMdocument to retrieve a list of html nodes for doing a CSS3 tree html visualization (disclaimer: my own page). However, I have a problem with the following code, the minimum one that reproduces the error:

// Compressed, testing html
$text = '<!DOCTYPE html><html><head><title>Demo</title></head><body>Bla</body></html>';

// Create and load the html
$dom = new DOMdocument();
$dom->loadHTML($text);

// Error test: it prints two html nodes instead of one
foreach ($dom->childNodes as $node)
  echo $node->nodeName;

The expected result is html, since I'm working with an html document and that's the top node. However, I'm obtaining htmlhtml, and the first node is empty (another check) while the second one holds all the valuable info. So,

Why are there two html nodes in a single-node html document?

Note: I've tried deleting the <!DOCTYPE html> and the <html> tags but neither solution worked.

If you are not able to answer the previous question for any reason, at least answer this: will there always be two html nodes and the second one contain all the valuable information?

War es hilfreich?

Lösung

It appears the 1st node is a DOCUMENT_TYPE (type = 10), which seems to be always created, even with no <!DOCTYPE>. I suppose DOMdocument needs it to process the rest of the document.

The second node is your "real" document.

You can see the contents quickly like so:

$text = '<!DOCTYPE html><html><head><title>Demo</title><script>var a=10;</script></head><body>Bla</body></html>';

$dom = new DOMdocument();
$dom->loadHTML($text);

foreach ($dom->childNodes as $node)
{
    echo $node->nodeName;
    echo "<pre>";print_r($node);echo"</pre><br>\n";
    echo "<pre>";print_r(getArray($node));echo"</pre><br>";
    echo "<br>================================<br>";
}

function getArray($node)
{
    $array = false;

    if ($node->hasAttributes())
    {
        foreach ($node->attributes as $attr)
        {
            $array[$attr->nodeName] = $attr->nodeValue;
        }
    }

    if ($node->hasChildNodes())
    {
        if ($node->childNodes->length == 1)
        {
            $array[$node->firstChild->nodeName] = $node->firstChild->nodeValue;
        }
        else
        {
            foreach ($node->childNodes as $childNode)
            {
                if ($childNode->nodeType != XML_TEXT_NODE)
                {
                    $array[$childNode->nodeName][] = getArray($childNode);
                }
            }
        }
    }

    return $array;
}

Andere Tipps

If no doctype is present in loadHTML($input), the latter wraps the $input with it's own doctype, moreover, as you've mentioned yourself, removing all html tags yelds the same tesult.

If you run this code:

$text = '<head><title>Demo</title></head><body>Bla</body>';


$dom = new DOMdocument();
$dom->loadHTML($text);

echo $dom->saveHTML();

The output will be:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Demo</title></head><body>Bla</body></html>

The answer is yes, there will be always two parent html tags, where the second one holds the dom structure;

PS: loadHTML will enclose non-closed tags automatically also. See DOMDocument::loadHTML

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top