Question

Is there an encoding where an accented character like á or ä is treated as a single character? and if not then what would be the most commonly used encoding today? I'm using UTF7 currently, how compatible is that with other types of encoding?

Thanks

Was it helpful?

Solution

You might think about what you're asking for. You're asking for an encoding that will recognize 'á' and turn it into 'a'. That's a converter, not an encoding. It would have to know what encoding the source is in so that it can convert to whatever encoding you're using.

Wait, maybe that's not what you're asking. There are encodings that treat those as single bytes. For example, the ISO-8859-1 encoding (also called Latin-1) treats many accented characters as a single byte.

(The following struck out because I was talking about ASCII, not UTF-7 ... long day.)

UTF-7 isn't particularly compatible with many other encodings. It has 128 possible values: just enough space for the 52 letters (upper and lower case, combined) used in the Latin alphabet, the 10 numerals, 32 control characters, and various punctuation marks. But it's not sufficient for Spanish, for example, which has upside-down questions marks and exclamation points as well as other things.

UTF-7 is "compatible" with other encodings in that it can represent the entire Unicode character set. But only some characters (known as the "direct characters") and a few control characters can be directly encoded as single ASCII bytes. Those characters will be the same as in UTF-8 and in many single-byte character sets. All other characters are represented by sequences, and will be different from any other encoding.

The most commonly used encoding today? On the Web, UTF-8 is used a lot. It's also the default encoding used when you create a StreamWriter. For the work I do (mostly English, and Western European character sets), it works better than anything else.

Now, it's possible that what you're looking for is something that will treat 'á' and 'a' as the same in comparisons. That's a different question. See Performing Culture-Insensitive String Comparisons for information on that.

OTHER TIPS

This doesn't seem to have anything to do with encodings. In C# it doesn't matter what encoding you use for storage and transmission, the strings of characters are always internally in UTF-16 and ä is always 1 char long in composed form.

If "ä".Length is giving 2 to you, your string is in decomposed form and all you need to do is

string str = "ä"; //a + U+0308, .Length == 2
str = str.Normalize(NormalizationForm.FormC); //just ä now, with Length == 1

Sorry for the confusion over this issue, i finally found what i was looking for, which is that i needed my text to use Windows-1250 (Central European (Windows)) code page, because that is what a lot of other programs use, that correctly support characters like €đłŁ¤...etc

Thanks for all the help i got, it was a useful learning experience.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top