Question

How can i check a txt-file if it contains only valid characters, corresponding to the country-codepage?

because they get transfered to a linux-system, so every character has to be in the codepage

through searching over google i couldnt found anything helpfull

Is there a "clean" way to check this or are there only "dirty" (static) ways to do this?

Update: the situation is this that i have to check resource-files that contains the translations for a application. These files were translated in different countrys, so it could easy happen that a wrong character was typed in and later the application can't display it correct. Windows always searches for the nearly same-looking character, but linux doesnt. Thats the point.

Was it helpful?

Solution

You said "Assume the files come with UTF-8".

In that case, assume that you can read the file into a C# string or array of strings.

For example, if you have a byte[] array you can convert to C# UTF16 string like so:

var text = Encoding.UTF8.GetString(utf8Bytes);

Or you could (using the UTF8 encoding) read it directly from a file into a C# string. Lets assume you can do that bit yourself.

Now - given you have a C# string - you can use a Encoding.GetEncoding() with a code page parameter and a EncoderExceptionFallback and a DecoderExceptionFallback to check whether the string is valid in that particular code page, like so:

public static bool IsStringValidForCodePage(string text, int codePage)
{
    var encoder = Encoding.GetEncoding(codePage, new EncoderExceptionFallback(), new DecoderExceptionFallback());

    try
    {
        encoder.GetBytes(text);
    }

    catch (EncoderFallbackException)
    {
        return false;
    }

    return true;
}

The following example uses the Greek Pi character, "π", which is valid in code page 1253 (Greek) and invalid in code page 1252 (Latin 1).

string pi = "π"; // Mmmm. I like pi.

if (IsStringValidForCodePage(pi, 1252))
    Console.WriteLine("Pi is ok in 1252");
else
    Console.WriteLine("Pi is NOT ok in 1252"); // Prints NOT ok.

if (IsStringValidForCodePage(pi, 1253))
    Console.WriteLine("Pi is ok in 1253");  // Prints ok.
else
    Console.WriteLine("Pi is NOT ok in 1253");

OTHER TIPS

If you can get the translators to give you UTF-8 text, you can use a program to convert to the desired code page. You load the string into memory, create an instance of the target Encoding, and then call Encoding.GetBytes to convert the string to the proper byte sequence. Read the documentation there and the linked article about character encodings to learn how to detect and handle translation errors.

Update in response to comment:

If you set the Encoder.Fallback property, then that method will be invoked whenever there is an error converting a character. So if the Encoder.Fallback method is called, there was a conversion error. Meaning that you don't have to manually examine the converted text.

Character set conversion can be a difficult problem. I strongly suggest that you read the article Character Encoding in the .NET Framework.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top