Question

How do you deal with broken data in XML files? For example, if I had

<text>Some &improper; text here.</text>

I'm trying to do:

 $doc = new DOMDocument();
 $doc->validateOnParse = false;
 $doc->formatOutput = false;
 $doc->load(...xml');

and it fails miserably, because there's an unknown entity. Note, I can't use CDATA due to the way the software is written. I'm writing a module which reads and writes XML, and sometimes the user inserts improper text.

I've noticed that DOMDocument->loadHTML() nicely encodes everything, but how could I continue from there?

Was it helpful?

Solution

Perhaps you can use preg_replace_callback to do the heavy lifting with entities for you:

http://php.net/manual/en/function.preg-replace-callback.php

function fixEntities($data) {
    switch(substr($data, 1, strlen($data) - 2)) {
        case 'amp':
        case 'lt':
        case 'gt':
        case 'quot': // etc., etc., etc.
            return $data;
    }
    return '';
}
$xml = preg_replace_callback('/&([a-zA-Z0-9#]*);{1}/', 'fixEntities', $xml);

OTHER TIPS

Use htmlspecialchars to serialize special xml characters before pushing the input into your xml/xhtml dom. While its name is prefixed with "html", based on the only characters it replaces, it is truely useful for xml data serialization.

If you are the one who writes the xml, there should be no problem, as you can encode any user input into entities before putting it into xml.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top