Is this XPath query on parsing XHTML wrong? using TouchXML

https://stackoverflow.com/questions/7038667

22-12-2020
|

Question

I have been trying to parse a XHTML doc via TouchXML, but it always can't find any tags via XPath query.

Below is the XHTML:

XHTML <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
   <head>
      <meta name="generator" content=
         "HTML Tidy for Mac OS X (vers 25 March 2009), see www.w3.org" />
      <title></title>
      </head>
   <body>
      <p>
          <a href="http://www.flickr.com/photos/55397648@N00/5987335786/"
             title="casavermeer5.jpg by the style files, on Flickr">
          <img src="http://farm7.static.flickr.com/6127/5987335786_abec990554_o.jpg"
               width="500" height="750" border="0" alt="casavermeer5.jpg" />
          </a>
      </p>
   </body>
</html>

So, we can see there are a "p" tag, "a" tag and "img" tag

What I did then is shown as the code below:

CXHTMLDocument *doc = [[[CXHTMLDocument alloc] initWithXHTMLString:XHTML options:0 error:&error] autorelease];
NSLog(@"error %@", [error localizedDescription]);
NSLog(@"doc children count = %d", [doc childCount]);
NSArray *imgNodeArray = [doc nodesForXPath:@"//img" error:&error];
NSLog(@"imgNodeArray = %d", [imgNodeArray count]);
NSLog(@"error %@", [error localizedDescription]);

The results are

error (null)
doc children count = 2
imgNodeArray = 0
error (null)

So, there are no error at all in parsing the XHTML doc and no error for the XPath query. Also this doc has two children under the root ("body" tag and "head" tag). But the problem is it cannot find the "img" tag. I have tried to replace "img" with other possible tag names (such as p, a, even body, head), no luck at all.

Can someone help me here?

P.S.

Actually the original doc is a HTML, I have used CTidy class in TouchXML lib to tidy the HTML to XHTML first. The XHTML above came from that CTidy results.

I also tried to add a namespace thing to the XPath query, like this

NSMutableDictionary *namespaceDict = [NSMutableDictionary dictionary];
[namespaceDict setValue:@"http://www.w3.org/1999/xhtml" forKey:@"xhtml"];

And change the XPath query to

NSArray *imgNodeArray = [doc nodesForXPath:@"//xhtml:img" namespaceMappings:namespaceDict error:&error];

Still no luck, can't find any results.

Solution

Try this //img. When you use // it gets the img tag, no matter where it is in the page.
It is better than //xhtml:img - because sometimes the hierarchic tags change a bit in the code behind, so it is better to be global, and not too much specific.

OTHER TIPS

I had a similar problem once that might help you. I had a document that I would parse and find certain landmarks and record their XPaths. Then, I would load the document into a UIWebView and run JavaScript to perform actions on the elements that I had previously marked. Problematically, the DOM structure was completely different after parsing the document and all my XPaths were invalid. One particular case related to tables.

<table>
    <tr>
        <td>Cell</td>
    </tr>
</table>

The simple HTML above would always be converted to something like below. (The white space is only for readability and I'm going from memory.)

<table>
    <thead></thead>
    <tbody>
        <tr>
            <td>Cell</td>
        </tr>
    </tbody>
</table>

My point with this is that your parser may have injected elements into your HTML structure.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow