Question

I am looking for the way to normalise the list of titles. The title is normalized to be stored in a database as a sort and look up key. "Normalize" means many things such as converting to lowercase, removing the roman accent character, or removing preceding "the", "a" or "an".

In iOS or Mac, NSString class has stringByFoldingWithOptions:locale: method to get the folding version of string.

NSString Class Reference - stringByFoldingWithOptions:locale:

In Java, java.uril.Collator class seems to be useful for comparing, but there seems no way to convert for such purpose.

Was it helpful?

Solution

You can use java.text.Normalizer which comes close to normalizing Strings in Java. Though regex are also a powerful way to manipulate the Strings in whichever way possible.

Example of accent removal:

String accented = "árvíztűrő tükörfúrógép";
String normalized = Normalizer.normalize(accented,  Normalizer.Form.NFD);
normalized = normalized.replaceAll("[^\\p{ASCII}]", "");

System.out.println(normalized);

Output:

arvizturo tukorfurogep

More explanation here: http://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top