Question

Can anyone recommend a C or Objective-C library for HTML parsing? It needs to handle messy HTML code that won't quite validate.

Does such a library exist, or am I better off just trying to use regular expressions?

Was it helpful?

Solution 2

Looks like libxml2.2 comes in the SDK, and libxml/HTMLparser.h claims the following:

This module implements an HTML 4.0 non-verifying parser with API compatible with the XML parser ones. It should be able to parse "real world" HTML, even if severely broken from a specification point of view.

That sounds like what I need, so I'm probably going to use that.

OTHER TIPS

I found using hpple quite useful to parse messy HTML. Hpple project is a Objective-C wrapper on the XPathQuery library for parsing HTML. Using it you can send an XPath query and receive the result .

Requirements:

-Add libxml2 includes to your project

  1. Menu Project->Edit Project Settings
  2. Search for setting "Header Search Paths"
  3. Add a new search path "${SDKROOT}/usr/include/libxml2"
  4. Enable recursive option

-Add libxml2 library to to your project

  1. Menu Project->Edit Project Settings
  2. Search for setting "Other Linker Flags"
  3. Add a new search flag "-lxml2"

-From hpple get the following source code files an add them to your project:

  1. TFpple.h
  2. TFpple.m
  3. TFppleElement.h
  4. TFppleElement.m
  5. XPathQuery.h
  6. XPathQuery.m

-Take a walk on w3school XPath Tutorial to feel comfortable with the XPath language.

Code Example

#import "TFHpple.h"

NSData *data = [[NSData alloc] initWithContentsOfFile:@"example.html"];

// Create parser
xpathParser = [[TFHpple alloc] initWithHTMLData:data];

//Get all the cells of the 2nd row of the 3rd table 
NSArray *elements  = [xpathParser searchWithXPathQuery:@"//table[3]/tr[2]/td"];

// Access the first cell
TFHppleElement *element = [elements objectAtIndex:0];

// Get the text within the cell tag
NSString *content = [element content];  

[xpathParser release];
[data release];

Known issues

As hpple is a wrapper over XPathQuery which is another wrapper, this option probably is not the most efficient. If performance is an issue in your project, I recommend to code your own lightweight solution based on hpple and xpathquery library code.

Just in case anyone has got here by googling for a nice XPath parser and gone off and used TFHpple, Note that TFHpple uses XPathQuery. This is pretty good, but has a memory leak.

In the function *PerformXPathQuery, if the nodes are found to be nil, it jumps out before cleaning up.

So where you see this bit of code: Add in the two cleanup lines.

  xmlNodeSetPtr nodes = xpathObj->nodesetval;
  if (!nodes)
    {
      NSLog(@"Nodes was nil.");
        /* Cleanup */
        xmlXPathFreeObject(xpathObj);
        xmlXPathFreeContext(xpathCtx);
      return nil;
    }

If you are doing a LOT of parsing, it's a vicious leak. Now.... how do I get my night back :-)

I wrote a lightweight wrapper around libxml which maybe useful:

Objective-C-HMTL-Parser

This probably depends on how messy the HTML is and what you want to extract. But usually Tidy does quite a good job. It is written in C and I guess you should be able to build and statically link it for the iPhone. You can easily install the command line version and test the results first.

You may want to check out ElementParser. It provides "just enough" parsing of HTML and XML. Nice interfaces make walking around XML / HTML documents very straightforward. http://touchtank.wordpress.com/

How about using the Webkit component, and possibly third party packages such as jquery for tasks such as these? Wouldn't it be possible to fetch the html data in an invisible component and take advantage of the very mature selectors of the javascript frameworks?

Google's GData Objective-C API reimplements NSXMLElement and other related classes that Apple removed from the iPhone SDK. You can find it here http://code.google.com/p/gdata-objectivec-client/. I've used it for dealing messaging via Jabber. Of course if your HTML is malformed (missing closing tags) this might not help much.

We use Convertigo to parse HTML on the server side and return a clean and neat JSON web services to our Mobile Apps

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top