You said "Assume the files come with UTF-8".
In that case, assume that you can read the file into a C# string or array of strings.
For example, if you have a byte[]
array you can convert to C# UTF16 string like so:
var text = Encoding.UTF8.GetString(utf8Bytes);
Or you could (using the UTF8 encoding) read it directly from a file into a C# string. Lets assume you can do that bit yourself.
Now - given you have a C# string - you can use a Encoding.GetEncoding()
with a code page parameter and a EncoderExceptionFallback
and a DecoderExceptionFallback
to check whether the string is valid in that particular code page, like so:
public static bool IsStringValidForCodePage(string text, int codePage)
{
var encoder = Encoding.GetEncoding(codePage, new EncoderExceptionFallback(), new DecoderExceptionFallback());
try
{
encoder.GetBytes(text);
}
catch (EncoderFallbackException)
{
return false;
}
return true;
}
The following example uses the Greek Pi character, "π", which is valid in code page 1253 (Greek) and invalid in code page 1252 (Latin 1).
string pi = "π"; // Mmmm. I like pi.
if (IsStringValidForCodePage(pi, 1252))
Console.WriteLine("Pi is ok in 1252");
else
Console.WriteLine("Pi is NOT ok in 1252"); // Prints NOT ok.
if (IsStringValidForCodePage(pi, 1253))
Console.WriteLine("Pi is ok in 1253"); // Prints ok.
else
Console.WriteLine("Pi is NOT ok in 1253");