Question

I'm trying to extract some names from html. For example, the string may look like this:

Doe, J

the pattern I'm using is:

\w+, \w

Everything works fine as long as the names consists of the letters from english alphabet. The same pattern doesn't match spanish or polish names:

Cortázar, J
Król, S

Obviously the specific characters are problematic. Any ideas what to do to make \w match these characters? I looked into the NSRegularExpression options, but I don't think it supports it. Or maybe I'm missing the point here and I should come up with smarter regular expression?

Was it helpful?

Solution 2

NSRegularExpression does not seem to match unicode characters in a good way, what you could do is instead match everything up to a delimiter, which I assume you have?

^(\X+?),$

This would create a capture group with the results you want and it should match unicode as well.

You can also use the matching character expressions \u or \p. This can be seen here. This article describes them.

OTHER TIPS

One of the problems with \w is that you need to get it into a string with all the backslashes properly escaped.

NSArray *names = @[@"Cortázar, J", @"Król, S", @"Don't Match This", @"Doe, J", @"Høegh, K"];

NSString *pattern = @"\\w+, \\w";
NSPredicate *pred = [NSPredicate predicateWithFormat: @"self MATCHES %@", pattern];

NSArray* result = [names filteredArrayUsingPredicate: pred];

It correctly matches the names, but leaves out the "wrong" string at index 2.

This shows that you can match the strings with an NSRegularExpression as the predicate engine uses the same regular expression syntax.

Edited to add:

If you insist on using an NSRegularExpression directly, then you can see it work with a little more code:

// The names and pattern variables taken from code above

NSError *error = NULL;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:pattern
                                                                       options:NSRegularExpressionCaseInsensitive
                                                                         error:&error];    

for (NSString *string in names) {

    NSRange rangeOfFirstMatch = [regex rangeOfFirstMatchInString:string options:0 range:NSMakeRange(0, [string length])];
    if (!NSEqualRanges(rangeOfFirstMatch, NSMakeRange(NSNotFound, 0))) {
        NSString *result = [string substringWithRange:rangeOfFirstMatch];
        NSLog(@"Match: '%@'", result);
    }
    else {
        NSLog(@"No match: '%@'", string);
    }
}

You will see that it matches the names, with either pure ASCII or various European accented characters, but it does not match the string "Don't Match This".

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top