Determine if input file is usable by program

https://stackoverflow.com/questions/23230982

07-07-2023
|

Question

I have a C# program that looks through directories for .txt files and loads each into a DataTable.

static IEnumerable<string> ReadAsLines(string fileName)
{
    using (StreamReader reader = new StreamReader(fileName))
        while (!reader.EndOfStream)
            yield return reader.ReadLine();
}

public DataTable GetTxtData()
{
    IEnumerable<string> reader = ReadAsLines(this.File);

    DataTable txtData = new DataTable();

    string[] headers = reader.First().Split('\t');

    foreach (string columnName in headers)
        txtData.Columns.Add(columnName);

    IEnumerable<string> records = reader.Skip(1);

    foreach (string rec in records)
        txtData.Rows.Add(rec.Split('\t'));

    return txtData;
}

This works great for regular tab-delimited files. However, the catch is that not every .txt file in the folders I need to use contains tab-delimited data. Some .txt files are actually SQL queries, notes, etc. that have been saved as plain text files, and I have no way of determining that beforehand. Trying to use the above code on such files clearly won't lead to the expected result.

So my question is this: How can I tell whether a .txt file actually contains tab-delimited data before I try to read it into a DataTable using the above code?

Just searching the file for any tab character won't work because, for example, a SQL query saved as plain text might have tabs for code formatting.

Any guidance here at all would be much appreciated!

Solution

If each line contains the same number of elements, then simply read each line, and verify that you get the correct number of fields in each record. If not error out.

if (headers.Count() != CORRECTNUMBER) 
{
    // ERROR
}

foreach (string rec in records)
{
    string[] recordData = rec.Split('\t');
    if (recordData.Count() != headers.Count())
    {
         // ERROR
    }

    txtData.Rows.Add(recordData);
}

OTHER TIPS

To do this you need a set of "signature" logic providers which can check a given sample of the file for "signature" content. This is similar to how virus scanners work.

Consider you would create a set of classes where the ISignature was implemented by set of classes;

class TSVFile : ISignature
{
    enumFileType ISignature.Evaluate(IEnumerable<byte> inputStream);
}

class SQLFile : ISignature
{
    enumFileType ISignature.Evaluate(IEnumerable<byte> inputStream);
}

each one would read an appropriate number of bytes in and return the known file type, if it can be evaluated. Each file parser would need its own logic to determine how many bytes to read and on what basis to make its evaluation.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow