iOS: Html parsing - how to ignore tags like a, li, etc.. within <p>
-
02-07-2021 - |
Question
I am currently using Hpple to parse HTML, like so:
TFHpple *htmlParser = [TFHpple hppleWithHTMLData:[currentString dataUsingEncoding:NSUTF8StringEncoding]];
NSString *paragraphsXpathQuery = @"//p//text()";
NSArray *paragraphNodes = [htmlParser searchWithXPathQuery:paragraphsXpathQuery];
if ([paragraphNodes count] > 0) {
NSMutableArray *tempArray = [NSMutableArray array];
for (TFHppleElement *element in paragraphNodes) {
[tempArray addObject:[element content]];
}
article.paragraphs = tempArray;
}
This way I get an array of paragraphs and I can use NSString *result = [myArray componentsJoinedByString:@"\n\n"];
to compile it into a single body of text with line breakes.
However, if the html contains tags, they are interpreted as individual entities and will get line breaked on their own right, so at the end of the day from a line like this:
<p>I went to the <a href="blablabla.html">shop</a> to get some milk!</a></p>
<p>It was awesome.</p>
I get this:
I went to the
shop
to get some milk!
It was awesome!
And of course I would like to get this (ignore other tags inside the p
tag):
I went to the shop to get some milk!
It was awesome!
Can you help me out?
Solution
NSString *HTMLTags = @"<[^>]*>"; //regex to remove any html tag
NSString *htmlString = @"<html>bla bla</html>";
NSString *stringWithoutHTML = [hstmString stringByReplacingOccurrencesOfRegex:myregex withString:@""];
don't forget to include this in your code : #import "RegexKitLite.h" here is the link to download this API : http://regexkit.sourceforge.net/#Downloads
OTHER TIPS
In XPath 1.0 you can do this in two steps:
Select all
p
elements://p
On each selected
p
element (used as the initial context node) evaluate this:string()
Explanation:
By definition, the result of applying the standard XPath function string()
to an element is the concatenation (in document order) of all of its text-node descendants.