Question

Part of a list of projects I'm doing is a little text-editor.

At one point, you can load all the sub directories and files in a given directory. The program will add each as a node in a TreeView.

What I want the functionality to be is to only add the files that are readable by a normal text reader.

This code currently adds it to the tree:

TreeNode navNode = new TreeNode();
navNode.Text = file.Name;
navNode.Tag = file.FullName;

 directoryNode.Nodes.Add(navNode);

I know I could easily create an if statement with something like:

if(file.extension.equals(".txt"))

but I would have to expand that statement to contain every single extension that it could possibly be.

Is there an easier way to do this? I'm thinking it may have something to do with the mime types or file encoding.

Was it helpful?

Solution

There is no general way of figuring type of information stored in the file.

Even if you know in advance that it is some sort of text if you don't know what encoding was used to create file you may not be able to load it properly.

Note that HTTP give you some hints on type of file by content-type header, but there is no such information on file system.

OTHER TIPS

There are a few methods you could use to "best guess" whether or not the file is a text file. Of course, the more encodings you support, the harder this becomes, especially if plan to support CJK (Chinese, Japanese, Korean) scripts. Let's just start with Encoding.Ascii and Encoding.UTF-8 for now.

Fortunately, most non-text files (executables, images, and the like) have a lot of non-parsable characters in their first couple of kilobytes.

What you could do is take a file and scan the first 1-4KB (up to you) and see if any "non-printable" characters come up. This operation shouldn't take much time and will at least give you some certainty of the contents of the file.

public static async Task<bool> IsValidTextFileAsync(string path,
                                                    int scanLength = 4096)
{
  using(var stream = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.Read))
  using(var reader = new StreamReader(stream, Encoding.UTF8))
  {
    var bufferLength = (int)Math.Min(scanLength, stream.Length);
    var buffer = new char[bufferLength];

    var bytesRead = await reader.ReadBlockAsync(buffer, 0, bufferLength);
    reader.Close();

    if(bytesRead != bufferLength)
      throw new IOException("There was an error reading from the file.");

    for(int i = 0; i < bytesRead; i++)
    {
      var c = buffer[i];

      if(char.IsControl(c))
        return false;
    }

    return true;
  }
}

My approach based on @Rubenisme's comment and @Erik's answer.

    public static bool IsValidTextFile(string path)
    {
        using (var stream = System.IO.File.Open(path, System.IO.FileMode.Open, System.IO.FileAccess.Read, System.IO.FileShare.Read))
        using (var reader = new System.IO.StreamReader(stream, System.Text.Encoding.UTF8)) 
        {
            var bytesRead = reader.ReadToEnd();
            reader.Close();
            return bytesRead.All(c => // Are all the characters either a:
                c == (char)10  // New line
                || c == (char)13 // Carriage Return
                || c == (char)11 // Tab
                || !char.IsControl(c) // Non-control (regular) character
                );
        }
    }

A hacky way to do it would be to see if the file contains any of the lower control characters (0-31) that aren't forms of white space (carriage return, tab, vertical tab, line feed, and just to be safe null and end of text). If it does, then it is probably binary. If it does not, it probably isn't. I haven't done any testing or anything to see what happens when applying this rule to non ASCII encodings, so you'd have to investigate further yourself :)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top