Question

I have html stored in NSString. I download from internet and parse it using NSXMLParser. It seems however that it has problems with entities such as ó, „, ’ etc. Quite a big problem acutally, because it just tells me it failed and stops parsing any further.

I found some good solutions for this in different topics here on stackoverflow, but they recommended using NSString + HTML category or Google Toolbox for Mac (NSString category uses GTM). I already had projects that used GTM and it made running my app on iOS simulator impossible, so I'd like to avoid that.

Was it helpful?

Solution

It’s probably easiest just to write your own method for this, using NSScanner. Note that the entity syntax is a little more complex than just a list of replacements; namely, you will need to support:

  • &#D; where D is a decimal number
  • &#x*H*; where H is a hexadecimal number (upper and lower case are both OK)

and then you’ll need a table of mappings for the named entities (there’s a list in the HTML 4 specification).

Here’s some code (written in Stack Overflow, untested) to get you started:

static NSDictionary *entityDict;

if (!entityDict)
  entityDict = loadEntityMappingTable();

NSScanner *scanner = [NSScanner scannerWithString:myHTMLString];
NSMutableString *result = [NSMutableString string];

[scanner setCharactersToBeSkipped:nil]; // Don’t skip whitespace

while (![scanner isAtEnd]) {
  NSString *chunk, *name;

  if ([scanner scanUpToString:@"&" intoString:chunk])
    [result appendString:chunk];

  if ([scanner scanString:@"#" intoString:NULL]) {
    unsigned uch;
    NSUInteger scanLoc;
    BOOL hex = NO;

    // This is a numeric reference
    if ([scanner scanString:@"x" intoString:NULL]) {
      hex = YES;
      scanLoc = [scanner scanLocation];
      if (![scanner scanHexInt:&uch]) {
        // If we fail, show the entire thing in the result string
        [result appendString:@"&#x"];
        continue;
      }
    } else {
      int ich;
      scanLoc = [scanner scanLocation];
      if (![scanner scanInt:&ich]) {
        // If we fail, show the entire thing
        [result appendString:@"&#"];
        continue;
      }

      if (ich < 0) {
        // Bad Unicode code point
        [result appendString:@"&#"];
        [scanner setScanLocation:scanLoc];
        continue;
      }

      uch = (unsigned)ich;
    }

    // You may also care to prohibit control codes (depending on your application)
    // i.e. uch < 0x20 || uch >= 0x7f && uch < 0xa0

    if (uch >= 0xd800 && uch <= 0xdfff || uch > 0x10ffff) {
      // Bad Unicode code point; show it in the result
      [result appendString:hex ? @"&#x" : @"&#"];
      [scanner setScanLocation:scanLoc];
      continue;
    }

    if (![scanner scanString:@";" intoString:NULL]) {
      // Unterminated; show it in the result
      [result appendString:hex ? @"&#x" : @"&#"];
      [scanner setScanLocation:scanLoc];
      continue;
    }

    if (uch < 0xffff)
      [result appendFormat:@"%C", uch];
    else {
      unichar lo, hi;

      hi = 0xd800 | (uch >> 10);
      lo = 0xdc00 | (uch & 0x3ff);

      [result appendFormat:@"%C%C", hi, lo];
    }

    continue;
  }

  if ([scanner scanUpToString:@";" intoString:&name]) {
    NSString *ch;

    if (![scanner scanString:@";" intoString:NULL]) {
      // Unterminated; show it in the result
      [result appendFormat:@"&%@", name];
      continue;
    }

    ch = [entityDict objectForKey:[name lowercaseString]];

    if (!ch) {
      // Unrecognised; show it in the result
      [result appendFormat:@"&%@;", name];
      continue;
    }

    [result appendString:ch];
  }
}

Stick that in a function or method somewhere, implement loadEntityMappingTable() to initialise the dictionary of mappings and it should work.

FWIW, this same general approach, using a loop and an NSScanner, is easy to apply to lots of similar problems that in scripting languages might be dealt with using regular expression matching.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top