iOS: Html parsing - how to ignore tags like a, li, etc.. within <p>

https://stackoverflow.com/questions/12424920

02-07-2021
|

Question

I am currently using Hpple to parse HTML, like so:

TFHpple *htmlParser = [TFHpple hppleWithHTMLData:[currentString dataUsingEncoding:NSUTF8StringEncoding]];
NSString *paragraphsXpathQuery = @"//p//text()";
        NSArray *paragraphNodes = [htmlParser searchWithXPathQuery:paragraphsXpathQuery];
        if ([paragraphNodes count] > 0) {
            NSMutableArray *tempArray = [NSMutableArray array];
            for (TFHppleElement *element in paragraphNodes) {
                [tempArray addObject:[element content]];
            }
            article.paragraphs = tempArray;
        }

This way I get an array of paragraphs and I can use NSString *result = [myArray componentsJoinedByString:@"\n\n"]; to compile it into a single body of text with line breakes.

However, if the html contains tags, they are interpreted as individual entities and will get line breaked on their own right, so at the end of the day from a line like this:

<p>I went to the <a href="blablabla.html">shop</a> to get some milk!</a></p>
<p>It was awesome.</p>

I get this:

I went to the

shop

to get some milk!

It was awesome!

And of course I would like to get this (ignore other tags inside the p tag):

I went to the shop to get some milk!

It was awesome!

Can you help me out?

Solution

NSString *HTMLTags = @"<[^>]*>"; //regex to remove any html tag

NSString *htmlString = @"<html>bla bla</html>";
NSString *stringWithoutHTML = [hstmString stringByReplacingOccurrencesOfRegex:myregex withString:@""];

don't forget to include this in your code : #import "RegexKitLite.h" here is the link to download this API : http://regexkit.sourceforge.net/#Downloads

OTHER TIPS

In XPath 1.0 you can do this in two steps:

Select all p elements: //p
On each selected p element (used as the initial context node) evaluate this: string()

Explanation:

By definition, the result of applying the standard XPath function string() to an element is the concatenation (in document order) of all of its text-node descendants.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow