Question

I am writing a string compare function to sort medical terms that often contain special accented characters from many different European languages, and I need to somehow achieve a collation similar to MySQL's latin1_general_ci.

First, I'm doing some basic munging on the strings to remove spaces, quotes, hyphens, parentheses, etc. The problem comes when I pass the strings on to strcoll() using the default locale, because it is not smart enough to consider, for example, an accented e as lexicographically equivalent to a normal e.

I'm wary to use a locale like German or French because it probably will not include all of the special characters I need to consider. Is there a locale that will give me something to similar to the latin1_general_ci collation? Or is there maybe another solution?

My naive solution would be to create a large associative array to map accented letters to their regular letter equivalents, then using this with str_replace(), but that sounds slow and tedious (and error-prone). I would rather use something built into the language if possible.

Also on that note, does strcmp() or strcasecmp() respect the collation of the current locale, or is it just strcoll() that does this?

Was it helpful?

Solution

Maybe this:

setlocale(LC_COLLATE, 'fr_FR.Latin1', 'fr.Latin1', 'fr_FR.Latin-1', 'fr.Latin-1');

strcmp() and strcasecmp() are not localized.

OTHER TIPS

You can also try the iconv functions to help normalize the strings. That'll handle the accented e to normal e situations. See this related question about sorting utf8 strings, too.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top