It's best not to rely on Cocoa to figure out the string encoding if possible, especially if the data might be corrupted. A better approach would be to check if the value indicated by the HTTP Content-Type header specifies a character set like in this example:
Content-Type: text/html; charset=ISO-8859-4
Once you're able to parse and retrieve a character set name from the Content-Type header, you need to convert it to an NSStringEncoding
, first by passing it to CFStringConvertIANACharSetNameToEncoding
, and then passing the returned CF string encoding to CFStringConvertEncodingToNSStringEncoding
. After that, you can initialize your string using -[NSString initWithData:encoding:]
.
NSData *HTTPResponseBody = …; // Get the HTTP response body
NSString *charSetName = …; // Get a charset name from the Content-Type HTTP header
// Get the Core Foundation string encoding
CFStringEncoding cfencoding = CFStringConvertIANACharSetNameToEncoding((CFStringRef)charSetName);
// Confirm this is a known encoding
if (cfencoding != kCFStringEncodingInvalidId) {
// Initialize the string
NSStringEncoding nsencoding = CFStringConvertEncodingToNSStringEncoding(cfencoding);
NSString *JSON = [[NSString alloc] initWithData: HTTPResponseBody
encoding: nsencoding];
}
You still may run into problems if the string data you're working with is corrupted. For example, in the above code snippet, perhaps charSetName
is UTF-8, but HTTPResponseBody
can't be parsed as UTF-8 because there's an invalid byte sequence. In this situation, Cocoa will return nil
when you try to instantiate your string, and short of sanitizing the data so that it conforms to the reported string encoding (perhaps by stripping out invalid byte sequences), you may want to report an error back to the end user.
As a last-ditch effort — rather than reporting an error — you could initialize a string using an encoding that can handle anything you throw at it, such as NSMacOSRomanStringEncoding
. The one caveat here is that unicode / corrupted data may show up intermittently as symbols or unexpected alphanumerics.