Question

I'm looking for a way to compare string values where certain characters within the strings may be punctuated characters such as é or ô, however, the puncuation should be disregarded. For example, when searching a list of names, the user might provide criteria Rene which should match list entries Rene and René (i.e. ASCII 101 and 233 should be regarded as the same thing).

Thanx

EDIT: Preferably across all UNICODE characters. I suppose one could [should?] implement some custom solution for this; I was just wondering if there is something that already exist - almost like Char.GetBaseCharacterFromPunctuatedCharacter(char) :P

Was it helpful?

Solution

You did not say which language you are using so I answer using java. Other languages have similar constructs. Also, you mean diacritics, not punctuation (.,?!...)

The collator class supports the strength of comparison. For example, for Czech, difference in diacritics is considered to be a secondary difference.

Or you might want to remove any diacritics prior comparison by unicode canonical decomposition (decomposing all characters into basic letters + diacritics) and then removing the diacritics by a simple regex (see this SO question for an explanation of InCombiningDiacriticalMarks). In java:

public static String removeDiacritics(String str) {
  return Normalizer
     .normalize(string, Form.NFD)   // decompose into letters+diacritics
     .replaceAll("\\p{InCombiningDiacriticalMarks}+", ""); // remove diacritics
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top