There are a few methods you could use to "best guess" whether or not the file is a text file. Of course, the more encodings you support, the harder this becomes, especially if plan to support CJK (Chinese, Japanese, Korean) scripts. Let's just start with Encoding.Ascii
and Encoding.UTF-8
for now.
Fortunately, most non-text files (executables, images, and the like) have a lot of non-parsable characters in their first couple of kilobytes.
What you could do is take a file and scan the first 1-4KB (up to you) and see if any "non-printable" characters come up. This operation shouldn't take much time and will at least give you some certainty of the contents of the file.
public static async Task<bool> IsValidTextFileAsync(string path,
int scanLength = 4096)
{
using(var stream = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.Read))
using(var reader = new StreamReader(stream, Encoding.UTF8))
{
var bufferLength = (int)Math.Min(scanLength, stream.Length);
var buffer = new char[bufferLength];
var bytesRead = await reader.ReadBlockAsync(buffer, 0, bufferLength);
reader.Close();
if(bytesRead != bufferLength)
throw new IOException("There was an error reading from the file.");
for(int i = 0; i < bytesRead; i++)
{
var c = buffer[i];
if(char.IsControl(c))
return false;
}
return true;
}
}