Question

I'm trying to parse a text file that has a heading and the body. In the heading of this file, there are line number references to sections of the body. For example:

SECTION_A 256
SECTION_B 344
SECTION_C 556

This means, that SECTION_A starts in line 256.

What would be the best way to parse this heading into a dictionary and then when necessary read the sections.

Typical scenarios would be:

  1. Parse the header and read only section SECTION_B
  2. Parse the header and read fist paragraph of each section.

The data file is quite large and I definitely don't want to load all of it to the memory and then operate on it.

I'd appreciate your suggestions. My environment is VS 2008 and C# 3.5 SP1.

Was it helpful?

Solution

Well, obviously you can store the name + line number into a dictionary, but that's not going to do you any good.

Well, sure, it will allow you to know which line to start reading from, but the problem is, where in the file is that line? The only way to know is to start from the beginning and start counting.

The best way would be to write a wrapper that decodes the text contents (if you have encoding issues) and can give you a line number to byte position type of mapping, then you could take that line number, 256, and look in a dictionary to know that line 256 starts at position 10000 in the file, and start reading from there.

Is this a one-off processing situation? If not, have you considered stuffing the entire file into a local database, like a SQLite database? That would allow you to have a direct mapping between line number and its contents. Of course, that file would be even bigger than your original file, and you'd need to copy data from the text file to the database, so there's some overhead either way.

OTHER TIPS

You can do this quite easily.

There are three parts to the problem.

1) How to find where a line in the file starts. The only way to do this is to read the lines from the file, keeping a list that records the start position in the file of that line. e.g

List lineMap = new List();
lineMap.Add(0);    // Line 0 starts at location 0 in the data file (just a dummy entry)
lineMap.Add(0);    // Line 1 starts at location 0 in the data file

using (StreamReader sr = new StreamReader("DataFile.txt")) 
{
    String line;
    int lineNumber = 1;
    while ((line = sr.ReadLine()) != null)
        lineMap.Add(sr.BaseStream.Position);
}

2) Read and parse your index file into a dictionary.

Dictionary index = new Dictionary();

using (StreamReader sr = new StreamReader("IndexFile.txt")) 
{
    String line;
    while ((line = sr.ReadLine()) != null)
    {
        string[] parts = line.Split(' ');  // Break the line into the name & line number
        index.Add(parts[0], Convert.ToInt32(parts[1]));
    }
}

Then to find a line in your file, use:

int lineNumber = index["SECTION_B";];         // Convert section name into the line number
long offsetInDataFile = lineMap[lineNumber];  // Convert line number into file offset

Then open a new FileStream on DataFile.txt, Seek(offsetInDataFile, SeekOrigin.Begin) to move to the start of the line, and use a StreamReader (as above) to read line(s) from it.

Just read the file one line at a time and ignore the data until you get to the ones you need. You won't have any memory issues, but performance probably won't be great. You can do this easily in a background thread though.

Read the file until the end of the header, assuming you know where that is. Split the strings you've stored on whitespace, like so:

Dictionary<string, int> sectionIndex = new Dictionary<string, int>();
List<string> headers = new List<string>(); // fill these with readline

foreach(string header in headers) {
    var s = header.Split(new[]{' '});
    sectionIndex.Add(s[0], Int32.Parse(s[1]));
}

Find the dictionary entry you want, keep a count of the number of lines read in the file, and loop until you hit that line number, then read until you reach the next section's starting line. I don't know if you can guarantee the order of keys in the Dictionary, so you'd probably need the current and next section's names.

Be sure to do some error checking to make sure the section you're reading to isn't before the section you're reading from, and any other error cases you can think of.

You could read line by line until all the heading information is captured and stop (assuming all section pointers are in the heading). You would have the section and line numbers for use in retrieving the data at a later time.

string dataRow = "";

try
{
    TextReader tr = new StreamReader("filename.txt");

    while (true)
    {
        dataRow = tr.ReadLine();
        if (dataRow.Substring(1, 8) != "SECTION_")
            break;
        else
            //Parse line for section code and line number and log values
            continue;
    }
    tr.Close();
}
catch (Exception ex)
{
    MessageBox.Show(ex.Message);
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top