PHP DomDocument XML Load with Broken XML Data
Question
How do you deal with broken data in XML files? For example, if I had
<text>Some &improper; text here.</text>
I'm trying to do:
$doc = new DOMDocument();
$doc->validateOnParse = false;
$doc->formatOutput = false;
$doc->load(...xml');
and it fails miserably, because there's an unknown entity. Note, I can't use CDATA due to the way the software is written. I'm writing a module which reads and writes XML, and sometimes the user inserts improper text.
I've noticed that DOMDocument->loadHTML() nicely encodes everything, but how could I continue from there?
Solution
Perhaps you can use preg_replace_callback
to do the heavy lifting with entities for you:
http://php.net/manual/en/function.preg-replace-callback.php
function fixEntities($data) {
switch(substr($data, 1, strlen($data) - 2)) {
case 'amp':
case 'lt':
case 'gt':
case 'quot': // etc., etc., etc.
return $data;
}
return '';
}
$xml = preg_replace_callback('/&([a-zA-Z0-9#]*);{1}/', 'fixEntities', $xml);
OTHER TIPS
Use htmlspecialchars to serialize special xml characters before pushing the input into your xml/xhtml dom. While its name is prefixed with "html", based on the only characters it replaces, it is truely useful for xml data serialization.
If you are the one who writes the xml, there should be no problem, as you can encode any user input into entities before putting it into xml.