Parsing problematic XML in Querypath (dots in elements)

https://stackoverflow.com/questions/6351004

28-10-2019
|

Question

I am trying to parse an NewsML (http://www.iptc.org/std/NewsML-G2/2.7/examples/LISTING2_NewsML-G2_Complete.xml) document with querypath. But I have trouble with the dots in some elements, like <body.head>.

In some firefox querypath plugins I am able to escape the dot with a backslash, but in the php pear library this does not work.

Any ideas?

(I am looking for solution within Querypath, not for workarounds)

Solution

In the past, I've used the Tidy PHP extension (http://us3.php.net/manual/en/book.tidy.php) to clean up HTML/XML before passing it into QueryPath.

The XML you referenced above is pretty clean, and also pretty small.

If the only issue is dots in element names, preprocessing with a regular expression would probably work, too. And it would be the fastest solution. I'm guessing you could do a preg_replace('/<body\./g', '<body-', $xml) and have it fixed. (That would replace body.content with body-content and so on.)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow