Process unicode string in C and Objective C

https://stackoverflow.com/questions/23354016

11-07-2023
|

Question

I write a C function to read characters in an user-input string. Because this string is user-input, so it can contains any unicode characters. There's an Objective C method receives the user-input NSString, then convert this string to NSData and pass this data to the C function for processing. The C function searches for these symbol characters: *, [, ], _, it doesn't care any other characters. Everytime it found one of the symbols, it processes and then calls an Objective C method, pass the location of the symbol.

C code:

typedef void (* callback)(void *context, size_t location);

void process(const uint8_t *data, size_t length, callback cb, void *context)
{
    size_t i = 0;
    while (i < length)
    {
        if (data[i] == '*' || data[i] == '[' || data[i] == ']' || data[i] == '_')
        {
            int valid = 0;
            //do something, set valid = 1

            if (valid)
                cb(context, i);
        }
        i++;
    }
}

Objective C code:

//a C function declared in .m file
void mycallback(void *context, size_t location)
{
    [(__bridge id)context processSymbolAtLocation:location];
}

- (void)processSymbolAtLocation:(NSInteger)location
{
    NSString *result = [self.string substringWithRange:NSMakeRange(location, 1)];
    NSLog(@"%@", result);
}

- (void)processUserInput:(NSString*)string
{
    self.string = string;
    //convert string to data
    NSData *data = [string dataUsingEncoding:NSUTF8StringEncoding];
    //pass data to C function
    process(data.bytes, data.length, mycallback, (__bridge void *)(self));
}

The code works fine if the input string contains only English characters. If it contains composed character sequences, multibyte characters or other unicode characters, the result string in processSymbolAtLocation method is not the expected symbol.

How to convert the NSString object to NSData correctly? How to get the correct location?

Thanks!

Solution

Your problem is that you start off with a UTF-16 encoded NSString and produce a sequence of UTF-8 encoded bytes. The number of code units required to represent a string in UTF-16 may not be equal to that number required to represent it in UTF-8, so the offsets in your two forms may not match - as you have found out.

Why are you using C to scan the string for matches in the first place? You might want to look at NSString's rangeOfCharacterFromSet:options:range: method which you can use to find the next occurrence of character from your set.

If you need to use C then convert your string into a sequence of UTF-16 words and use uint16_t on the C side.

HTH

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow