Question

*updated with suggestions, but still taking time... started at 7 days of processing, now its taking 2,5 days. DataTableAdapter access is taking huge time.

I'm neewbie but intensive-researcher in stackoverflow, even so, couldn't find any answers that fit my problem.

I have 80 files, each with 200,000 lines - with little 'standards' or TAG's in format.

I've been able to search through each file, each line, and just replaced a IF-ELSE to a SWITCH-CASE (it improved performance, thanks to stackoverflow forum) and put intensive-stuff into another thread (again stackoverflow user's merit).

Even so, I'm getting 95 minutes per file, witch takes me to a 2,5 days-text-processing, and when deployed, I get a hang GUI (in debug its okay).

The txt file has this standard, with variable lines, :

BARR; --> thats first tag

184071; ALAMEDOS ; 518042,100; 922453,700; --> thats valid information I want

TAGs are (full line on txt): SE; -> CKT; -> BARR; -> TRECH; -> CAP; -> INST; -> KEY; -> REG; -> ET;xxxx; -> EP;xxxx; -> DMD; -->but can skip some "tags" without notice, thats why I'm testing line by line

My problem: - 2,5 days of intensive processing; (critical) - hanging gui after deployment; (not that bad, could solve later)

(thanks in advace!)

My winform click action - calling thread and the backgroundworker with the intensive stuff: (tried to wrap-up because its lenghty..)

`private void Button_extract_element_Click(object sender, EventArgs e)
 {
     TestObject test = new TestObject();
     test._shouldstop = true;
     backgroundWorker1.RunWorkerAsync(test);
     int passes = 0;

     Label_extract_element.Text = "wait processing....";
     Label_extract_element.Refresh();
     Label_extract_element.Update();

     //this should keep winform waiting for thread-return, showing passes
     while (test._shouldstop)
     {
        passes++;
        Label_extract_element.Text = "wait processing...." + passes;
        Label_extract_element.Refresh();
        Label_extract_element.Update();
     }
     Label_extract_element.Text = " OK, done!";
     Label_extract_element.Refresh();
     Label_extract_element.Update();
 } //End of Button_extract_element_Click

 class TestObject
    {
    public bool _shouldstop { get; set; }
    }   

 //backgroundWorker complete actions 
 private void backgroundWorker1_RunWorkerCompleted(object sender, RunWorkerCompletedEventArgs e)
    {
        // Receive the result from DoWork, and display it.
        TestObject test = e.Result as TestObject;
    }

 private void backgroundWorker1_DoWork(object sender, DoWorkEventArgs e)
 {
     TestObject argumentTest = e.Argument as TestObject;
     argumentTest._shouldstop = true;
     string loop = "";
     string[] ListOfFilesinDir = Directory.GetFiles(GlobalVariables.folder, "*.txt").Select(Path.GetFileName).ToArray();

     foreach (string filename in ListOfFilesinDir)
     {
        int count_barr = 0;
        int count_lines = 0;
        //ReadAll seems to process really fast - not a gap
        string[] FLines = File.ReadAllLines(GlobalVariables.folder + "\\" + filename);

        int[] line_barr = new int[FLines.Count()];

        foreach (string Lines in FLines)
        {
        count_lines++;
        switch (Lines)
        {
           case "SE;":
           GlobalVariables.SEstr = FLines[count_lines].Split(';')[3].Trim();
           break;

           case "CKT;":
           GlobalVariables.codCktAL = FLines[count_lines].Split(';')[2].Trim();
           GlobalVariables.nomeCktAL = FLines[count_lines].Split(';')[10].Trim();
           GlobalVariables.nomeArqv = filename;
           break;

           case "BARR;": loop = "BARR;"; break;
           case "TRECH;": loop = "TRECH;"; break;
           case "CAP;": loop = "CAP;"; break;
           case "INST;": loop = "INST;"; break;
           case "KEY;": loop = "KEY;"; break;
           case "REG;": loop = "REG;"; break;
           case "DMD;": 
              loop = "DMD;"; 
              GlobalVariables.TRAFO = (FLines[count_lines-8].Split(';')[1].Trim());
              break;
        }

        switch (loop)
        {
           // I'll post just the first case, so I dont go soooo long in this post..
           //This part seems to process really fast

           case "BARR;":
              GlobalVariables.parse_results = "";

              //take next line and test if is one of the nexts TAGs, and break loop:
              GlobalVariables.parse_results += FLines[count_lines];

              if (Equals(GlobalVariables.parse_results, "TRECH;") || Equals(GlobalVariables.parse_results, "CAP;") || Equals(GlobalVariables.parse_results, "INST;") || Equals(GlobalVariables.parse_results, "KEY;") || Equals(GlobalVariables.parse_results, "REG;") || Equals(GlobalVariables.parse_results.Split(';')[0], "ET") || Equals(GlobalVariables.parse_results.Split(';')[0], "EP"))
              {
                 GlobalVariables.parse_results = "";
                 loop = "";
                 break;
              }
              else  //initiates the extraction BARR just have 4 field in txt
              {
                 //save the number of the line to array for later reference
                 count_barr++;
                 line_barr[count_barr] = count_lines;
                 break;
              }
              case "TRECH;": /*repeat all in BARR loop for TRECH's 20 fields*/ break;
              case "CAP;": /*same repeat for different n fields*/ break;
              case "INST;": /*same repeat for different n fields*/ break;
              case "KEY;": /*same repeat for different n fields*/ break;
              case "REG;": /*same repeat for different n fields*/ break;
        } //end of switch
     } //end for each lines

     //Now the TAKING TIME: saving to database - take the line number reference stored

     for (int i = 1; i < (count_barr+1); i++)
     {
        double p0 = Convert.ToDouble(FLines[line_barr[i]].Split(';')[0].Trim());
        string p1 = FLines[line_barr[i]].Split(';')[1].Trim().ToString();
        double p2 = Convert.ToDouble(FLines[line_barr[i]].Split(';')[2].Trim());
        double p3 = Convert.ToDouble(FLines[line_barr[i]].Split(';')[3].Trim());
        barr_MTTableAdapter1.GRAVA(p0, p1, p2 , p3, GlobalVariables.SEstr, GlobalVariables.codCktAL, GlobalVariables.nomeCktAL, GlobalVariables.nomeArqv);
     } 
argumentTest._shouldstop = false;
e.Result = argumentTest;
}`
Was it helpful?

Solution 2

As stated in question, this answer only applies to MS Access Database, if you use Oracle or SQL Server just launch a Bulk load

Well, after a lot of contributions (see comments above specially from Voo) and a lot of stackoverflow research, I could improve the performance from 7 days to 45 minutes processing total 16 million lines, line-by-line.

The key after well-oriented tips from people in comments, was to use DAO (with some worries about deployment the database with ClickOnce - notice the dbName connection string).

A lot of usefull information can be found here: StackOverflow-Writing-large-records

If you use accdb, you need to do a modification in using ADO to: using DAO = Microsoft.Office.Interop.Access.Dao;
(can be found in Visual Studio add reference in COM Type references, you need to add the Microsoft Office xx.x Access Database Engine Object Library - but remember that this imposes a great limitation to your end-user specs.

I noticed some improvements needed to store all iterations in DAO (BARR, TRECH, so on..) but its code-optimization, not the main issue in this post.

Dont know the reason why .NET doenst add a bulk insert for MS Access.

The code above for each file, takes 0.3 seconds to pass switch statments, and 1.33 minutes to DAO-saving. If doing for all 80 files takes 45 minutes

 private void Button_extract_element_Click(object sender, EventArgs e)
 {
 TestObject test = new TestObject();
 test._shouldstop = true;
 backgroundWorker1.RunWorkerAsync(test);
 int passes = 0;

 Label_extract_element.Text = "wait processing....";
 Label_extract_element.Refresh();
 Label_extract_element.Update();

 //this should keep winform waiting for thread-return, showing passes
 while (test._shouldstop)
 {
    passes++;
    Label_extract_element.Text = "wait processing...." + passes;
    Label_extract_element.Refresh();
    Label_extract_element.Update();
 }
 Label_extract_element.Text = " OK, done!";
 Label_extract_element.Refresh();
 Label_extract_element.Update();
 } //End of Button_extract_element_Click

 class TestObject
 {
 public bool _shouldstop { get; set; }
 }   

 //backgroundWorker complete actions 
 private void backgroundWorker1_RunWorkerCompleted(object sender, RunWorkerCompletedEventArgs e)
 {
    // Receive the result from DoWork, and display it.
    TestObject test = e.Result as TestObject;
 }

 private void backgroundWorker1_DoWork(object sender, DoWorkEventArgs e)
 {
 TestObject argumentTest = e.Argument as TestObject;
 argumentTest._shouldstop = true;
 string loop = "";
 string[] ListOfFilesinDir = Directory.GetFiles(GlobalVariables.folder, "*.txt").Select(Path.GetFileName).ToArray();

 foreach (string filename in ListOfFilesinDir)
 {
    int count_barr = 0;
    int count_lines = 0;
    //ReadAll seems to process really fast - not a gap
    string[] FLines = File.ReadAllLines(GlobalVariables.folder + "\\" + filename);

    int[] line_barr = new int[FLines.Count()];

    foreach (string Lines in FLines)
    {
    count_lines++;
    switch (Lines)
    {
       case "SE;":
       GlobalVariables.SEstr = FLines[count_lines].Split(';')[3].Trim();
       break;

       case "CKT;":
       GlobalVariables.codCktAL = FLines[count_lines].Split(';')[2].Trim();
       GlobalVariables.nomeCktAL = FLines[count_lines].Split(';')[10].Trim();
       GlobalVariables.nomeArqv = filename;
       break;

       case "BARR;": loop = "BARR;"; break;
       case "TRECH;": loop = "TRECH;"; break;
       case "CAP;": loop = "CAP;"; break;
       case "INST;": loop = "INST;"; break;
       case "KEY;": loop = "KEY;"; break;
       case "REG;": loop = "REG;"; break;
       case "DMD;": 
          loop = "DMD;"; 
          GlobalVariables.TRAFO = (FLines[count_lines-8].Split(';')[1].Trim());
          break;
    }

    switch (loop)
    {
       // I'll post just the first case, so I dont go soooo long in this post..
       //This part seems to process really fast

       case "BARR;":
          GlobalVariables.parse_results = "";

          //take next line and test if is one of the nexts TAGs, and break loop:
          GlobalVariables.parse_results += FLines[count_lines];

          if (Equals(GlobalVariables.parse_results, "TRECH;") || Equals(GlobalVariables.parse_results, "CAP;") || Equals(GlobalVariables.parse_results, "INST;") || Equals(GlobalVariables.parse_results, "KEY;") || Equals(GlobalVariables.parse_results, "REG;") || Equals(GlobalVariables.parse_results.Split(';')[0], "ET") || Equals(GlobalVariables.parse_results.Split(';')[0], "EP"))
          {
             GlobalVariables.parse_results = "";
             loop = "";
             break;
          }
          else  
          {
             //store the number of the line to array for later reference
             count_barr++;
             line_barr[count_barr] = count_lines;
             break;
          }
          case "TRECH;": /*repeat all in BARR loop for TRECH's 20 fields*/ break;
          case "CAP;": /*same repeat for different n fields*/ break;
          case "INST;": /*same repeat for different n fields*/ break;
          case "KEY;": /*same repeat for different n fields*/ break;
          case "REG;": /*same repeat for different n fields*/ break;
    } //end of switch
 } //end for each lines

string dbName = Application.StartupPath + "\\Resources";
DAO.DBEngine dbEngine = new DAO.DBEngine();
DAO.Database db = dbEngine.OpenDatabase(dbName+"\\DataBase.accdb");

// From here, could work more to store different Tables with different fields, dynamically, improving code

DAO.Recordset rs = db.OpenRecordset("BARRA_MT");   

for (int i = 1; i < (count_barr+1); i++)
{
   rs.AddNew();
   double b0 = Convert.ToDouble(FLines[line_barr[i]].Split(';')[0].Trim());
   string b1 = FLines[line_barr[i]].Split(';')[1].Trim().ToString();
   double b2 = Convert.ToDouble(FLines[line_barr[i]].Split(';')[2].Trim());
   double b3 = Convert.ToDouble(FLines[line_barr[i]].Split(';')[3].Trim());
   rs.Fields["BARR_MT"].Value = b0;
   rs.Fields["COD"].Value = b1;
   rs.Fields["X"].Value = b2;
   rs.Fields["Y"].Value = b3;
   rs.Update();
}
rs.Close();     
db.Close();

argumentTest._shouldstop = false;
e.Result = argumentTest;
} //end`

OTHER TIPS

Your answer still is problematic. Use the following example to not read all lines into memory:

System.IO.StreamReader file = 
   new System.IO.StreamReader("c:\\test.txt");
while((line = file.ReadLine()) != null)
{
   Console.WriteLine (line);
   counter++;
}

There is no need for a tiny file like yours to take that long. I process files with about half a billion events (granted, those are binary coded, but waaaaaaay more than your 200.000 lines) in minutes. You're wasting lots of time by doing things like allocating an array of lines instead of reading the files line by line.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top