How to tell if a file is text-readable in C#

Question 1

There is no general way of figuring type of information stored in the file.

Even if you know in advance that it is some sort of text if you don't know what encoding was used to create file you may not be able to load it properly.

Note that HTTP give you some hints on type of file by content-type header, but there is no such information on file system.

Question 2

There are a few methods you could use to "best guess" whether or not the file is a text file. Of course, the more encodings you support, the harder this becomes, especially if plan to support CJK (Chinese, Japanese, Korean) scripts. Let's just start with Encoding.Ascii and Encoding.UTF-8 for now.

Fortunately, most non-text files (executables, images, and the like) have a lot of non-parsable characters in their first couple of kilobytes.

What you could do is take a file and scan the first 1-4KB (up to you) and see if any "non-printable" characters come up. This operation shouldn't take much time and will at least give you some certainty of the contents of the file.

public static async Task<bool> IsValidTextFileAsync(string path,
                                                    int scanLength = 4096)
{
  using(var stream = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.Read))
  using(var reader = new StreamReader(stream, Encoding.UTF8))
  {
    var bufferLength = (int)Math.Min(scanLength, stream.Length);
    var buffer = new char[bufferLength];

    var bytesRead = await reader.ReadBlockAsync(buffer, 0, bufferLength);
    reader.Close();

    if(bytesRead != bufferLength)
      throw new IOException("There was an error reading from the file.");

    for(int i = 0; i < bytesRead; i++)
    {
      var c = buffer[i];

      if(char.IsControl(c))
        return false;
    }

    return true;
  }
}

Question 3

My approach based on @Rubenisme's comment and @Erik's answer.

    public static bool IsValidTextFile(string path)
    {
        using (var stream = System.IO.File.Open(path, System.IO.FileMode.Open, System.IO.FileAccess.Read, System.IO.FileShare.Read))
        using (var reader = new System.IO.StreamReader(stream, System.Text.Encoding.UTF8)) 
        {
            var bytesRead = reader.ReadToEnd();
            reader.Close();
            return bytesRead.All(c => // Are all the characters either a:
                c == (char)10  // New line
                || c == (char)13 // Carriage Return
                || c == (char)11 // Tab
                || !char.IsControl(c) // Non-control (regular) character
                );
        }
    }

Question 4

A hacky way to do it would be to see if the file contains any of the lower control characters (0-31) that aren't forms of white space (carriage return, tab, vertical tab, line feed, and just to be safe null and end of text). If it does, then it is probably binary. If it does not, it probably isn't. I haven't done any testing or anything to see what happens when applying this rule to non ASCII encodings, so you'd have to investigate further yourself :)