Question

I know there are already multiple questions along these lines but I couldn't find anything close enough to my problem. I want to parse some XML that looks something like this. Only a few elements (maybe only <text/> will have mixed markup, the rest can all be easily parsed with SimpleXML:

<root>
  <element>
    <text>A <x>b</x> c <y>d</y> e.</text>
  </element>
</root>

I'm already using SimpleXML for most of the structure, however, when I get to the <text/> element I don't know how to read the parts separately (i.e. "A", "c" & "e." should be text, <x/> & <y/> should be elements) and in left-to-right order. All I can do is get all of the text without the markup or just the child elements without the text. If this is not possible in SimpleXML can I achieve this with DOM or XMLReader? I've been trying to turn the <text/> element into a DOMNodeList (so in this example I would have a list of five nodes) but I haven't been successful so far. What I've tried so far is:

dom_import_simplexml($xml)->getElementsByTagName('element'); // All <element/> elements
dom_import_simplexml($xml->element)->getElementsByTagName('text'); // Only one element, <text/>

There doesn't seem to be a method that returns a list of all child nodes (both text and tags) of a specific element. Are there any other classes in PHP that could do the job that I have overlooked? As far as I can tell so far SimpleXML can only fully parse XML where each element contains only text, only other elements or is empty.

Était-ce utile?

La solution

The following code does what I want using XMLReader, XMLReader::read() and XMLReader::nodeType:

<?php
$refl = new ReflectionClass('XMLReader');
$xml_consts = $refl->getConstants();
$xml = <<<XML
<root>
  <element>
    <text>A <x>b</x> c <y>d</y> e.</text>
  </element>
</root>
XML;
$reader = new XMLReader();
$reader->XML($xml);
// For validation only
$reader->setParserProperty(XMLReader::VALIDATE, true);
if ($reader->isValid()) {
    print("No matter what people say, this XML is valid!\n\n");
}
// Prevent warnings about missing DTD
$reader->setParserProperty(XMLReader::VALIDATE, false);
while ($reader->read()) {
    $info = ': ';
    switch ($reader->nodeType) {
        case XMLReader::TEXT:
            $info .= "'$reader->value'";
            break;
        case XMLReader::ELEMENT:
            $info .= "<$reader->name>";
            break;
        case XMLReader::END_ELEMENT:
            $info .= "</$reader->name>";
            break;
        default:
            $info = '';
    }
    print(array_search($reader->nodeType, $xml_consts)  . $info . PHP_EOL);
}
?>

It outputs:

No matter what people say, this XML is valid!

ELEMENT: <root>
SIGNIFICANT_WHITESPACE
ELEMENT: <element>
SIGNIFICANT_WHITESPACE
ELEMENT: <text>
TEXT: 'A '
ELEMENT: <x>
TEXT: 'b'
END_ELEMENT: </x>
TEXT: ' c '
ELEMENT: <y>
TEXT: 'd'
END_ELEMENT: </y>
TEXT: ' e.'
END_ELEMENT: </text>
SIGNIFICANT_WHITESPACE
END_ELEMENT: </element>
SIGNIFICANT_WHITESPACE
END_ELEMENT: </root>

Autres conseils

You can use DOM+Xpath for that, too. The following example iterates over all element and text nodes. The nice thing about this way is, that you can use any node as a context for other Xpath expressions.

$xml = <<<'XML'
<root>
  <element>
    <text>A <x>b</x> c <y>d</y> e.</text>
  </element>
</root>
XML;

$dom = new DOMDocument();
$dom->loadXML($xml);
$xpath = new DOMXpath($dom);

$nodes = $xpath->evaluate(
 '//*|//text()[normalize-space(.) != ""]'
);

foreach ($nodes as $node) {
  switch ($node->nodeType) {
  case XML_ELEMENT_NODE :
    var_dump("ELEMENT: ".$node->localName);
    break;
  case XML_TEXT_NODE :
  case XML_CDATA_SECTION_NODE :
    var_dump("TEXT: ".$node->textContent);
    break;
  }
}

Output: https://eval.in/152418

string(13) "ELEMENT: root"
string(16) "ELEMENT: element"
string(13) "ELEMENT: text"
string(8) "TEXT: A "
string(10) "ELEMENT: x"
string(7) "TEXT: b"
string(9) "TEXT:  c "
string(10) "ELEMENT: y"
string(7) "TEXT: d"
string(9) "TEXT:  e."
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top