Question

I'm trying to parse an XML string containing characters & < and > in the TEXTDATA. Normally, those characters should be htmlencoded, but in my case they aren't so I get the following messages:

Warning: DOMDocument::loadXML() [function.loadXML]: error parsing attribute name in Entity ... Warning: DOMDocument::loadXML() [function.loadXML]: Couldn't find end of Start Tag ...

I can use the str_replace to encode all the &, but if I do that with < or > I'm doing it for valid XML tags too.

Does anyone know a workaround for this problem??

Thank you!

Was it helpful?

Solution

If you have a < inside text in an XML... it's not a valid XML. Try to encode it or to enclose them into <![CDATA[.

If it's not possible (because you're not outputting this "XML") I'd suggest to try with some Html parsing library (I didn't used them, but they exists) beacuse they're less strict than XML ones.

But I'd really try to get valid XML before trying any other thing!!

OTHER TIPS

I often use @ in front of calls to load() for DomDocument mainly because you can never be absolutely sure what you load, is what you expected.

Using @ will suppress errors.

@$dom->loadXml($myXml);

I can use the str_replace to encode all the &, but if I do that with < or > I'm doing it for valid XML tags too.

As a strictly temporary fixup measure you can replace the ones that aren't part of what looks like a tag or entity reference, eg.:

$str= preg_replace('<(?![a-zA-Z_!?])', '&lt;', $str);
$str= preg_replace('&(?!([a-zA-Z]+|#[0-9]+|#x[0-9a-fA-F]+);)', '&amp;', $str);

However this isn't watertight and in the longer term you need to fix whatever is generating this bogus markup, or shout at the person who needs to fix it until they get a clue. Grossly-non-well-formed XML like this is simply not XML by definition.

Put all your text inside CDATA elements?

<!-- Old -->
<blah>
    x & y < 3
</blah>

<!-- New -->
<blah><![CDATA[
    x & y < 3
]]></blah>
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top