What's the easiest way to parse mixed text and element markup in XML?

StackOverflow https://stackoverflow.com/questions/23666904

  •  22-07-2023
  •  | 
  •  

سؤال

I know there are already multiple questions along these lines but I couldn't find anything close enough to my problem. I want to parse some XML that looks something like this. Only a few elements (maybe only <text/> will have mixed markup, the rest can all be easily parsed with SimpleXML:

<root>
  <element>
    <text>A <x>b</x> c <y>d</y> e.</text>
  </element>
</root>

I'm already using SimpleXML for most of the structure, however, when I get to the <text/> element I don't know how to read the parts separately (i.e. "A", "c" & "e." should be text, <x/> & <y/> should be elements) and in left-to-right order. All I can do is get all of the text without the markup or just the child elements without the text. If this is not possible in SimpleXML can I achieve this with DOM or XMLReader? I've been trying to turn the <text/> element into a DOMNodeList (so in this example I would have a list of five nodes) but I haven't been successful so far. What I've tried so far is:

dom_import_simplexml($xml)->getElementsByTagName('element'); // All <element/> elements
dom_import_simplexml($xml->element)->getElementsByTagName('text'); // Only one element, <text/>

There doesn't seem to be a method that returns a list of all child nodes (both text and tags) of a specific element. Are there any other classes in PHP that could do the job that I have overlooked? As far as I can tell so far SimpleXML can only fully parse XML where each element contains only text, only other elements or is empty.

هل كانت مفيدة؟

المحلول

The following code does what I want using XMLReader, XMLReader::read() and XMLReader::nodeType:

<?php
$refl = new ReflectionClass('XMLReader');
$xml_consts = $refl->getConstants();
$xml = <<<XML
<root>
  <element>
    <text>A <x>b</x> c <y>d</y> e.</text>
  </element>
</root>
XML;
$reader = new XMLReader();
$reader->XML($xml);
// For validation only
$reader->setParserProperty(XMLReader::VALIDATE, true);
if ($reader->isValid()) {
    print("No matter what people say, this XML is valid!\n\n");
}
// Prevent warnings about missing DTD
$reader->setParserProperty(XMLReader::VALIDATE, false);
while ($reader->read()) {
    $info = ': ';
    switch ($reader->nodeType) {
        case XMLReader::TEXT:
            $info .= "'$reader->value'";
            break;
        case XMLReader::ELEMENT:
            $info .= "<$reader->name>";
            break;
        case XMLReader::END_ELEMENT:
            $info .= "</$reader->name>";
            break;
        default:
            $info = '';
    }
    print(array_search($reader->nodeType, $xml_consts)  . $info . PHP_EOL);
}
?>

It outputs:

No matter what people say, this XML is valid!

ELEMENT: <root>
SIGNIFICANT_WHITESPACE
ELEMENT: <element>
SIGNIFICANT_WHITESPACE
ELEMENT: <text>
TEXT: 'A '
ELEMENT: <x>
TEXT: 'b'
END_ELEMENT: </x>
TEXT: ' c '
ELEMENT: <y>
TEXT: 'd'
END_ELEMENT: </y>
TEXT: ' e.'
END_ELEMENT: </text>
SIGNIFICANT_WHITESPACE
END_ELEMENT: </element>
SIGNIFICANT_WHITESPACE
END_ELEMENT: </root>

نصائح أخرى

You can use DOM+Xpath for that, too. The following example iterates over all element and text nodes. The nice thing about this way is, that you can use any node as a context for other Xpath expressions.

$xml = <<<'XML'
<root>
  <element>
    <text>A <x>b</x> c <y>d</y> e.</text>
  </element>
</root>
XML;

$dom = new DOMDocument();
$dom->loadXML($xml);
$xpath = new DOMXpath($dom);

$nodes = $xpath->evaluate(
 '//*|//text()[normalize-space(.) != ""]'
);

foreach ($nodes as $node) {
  switch ($node->nodeType) {
  case XML_ELEMENT_NODE :
    var_dump("ELEMENT: ".$node->localName);
    break;
  case XML_TEXT_NODE :
  case XML_CDATA_SECTION_NODE :
    var_dump("TEXT: ".$node->textContent);
    break;
  }
}

Output: https://eval.in/152418

string(13) "ELEMENT: root"
string(16) "ELEMENT: element"
string(13) "ELEMENT: text"
string(8) "TEXT: A "
string(10) "ELEMENT: x"
string(7) "TEXT: b"
string(9) "TEXT:  c "
string(10) "ELEMENT: y"
string(7) "TEXT: d"
string(9) "TEXT:  e."
مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top