Question

I have been given a very large Excel workbook that contains 500+ sheets. Each sheet represents a store, and the rows contain transactions at that store. Each sheet layout is identical. I have been requested to write a program that loops through each sheet, pulls specific transaction data, and writes everything to one gigantic CSV file.

I know that this kind of functionality is far better suited for relational database, but I have been asked to work on this as is.

I have written a program that successfully parses the data and writes it. The problem is that it takes almost a half-hour to complete the file write when reading and writing data synchronously.

I would like to accomplish this task by reading and writing the data from each sheet asynchronously. In C#, I would prefer to use the Task Parallel library for this, but am open to other options.

I am thinking about spinning off the worker threads from a foreach loop, like so:

foreach( Worksheet ws in _excelApp.Worksheets)
{
    Parallel.Invoke(()=>ExportWorksheet(ws));
}

And then in the method (shortened for brevity):

private void ExportWorksheet(Worksheet ws)
{         
     using(FileStream fs = new new FileStream(fi.FullName, FileMode.Append, FileAccess.Write, FileShare.Write, 1, true))
     {
         for(int row = 1; row < 300; row++)
         {
              for(int column = 1; column < 20)
              {
                   byte[] bytes = Encoding.ASCII.GetBytes(ws.Cells[row, column].Value.ToString() + ",");
                   fs.Write(bytes, 0, bytes.count());
              }

              fs.Write(Encoding.ASCII.GetBytes("\n"), 0, 2);
         } 
     }

}

This gives me strange results, of course.

Am I on the right track? Should I be using a different encoding? Is there a cleaner way to accomplish the async write? Are there any threading rules being broken here?

All suggestions are welcome. Thanks for the help.

Was it helpful?

Solution

Instead of looping through the rows and columns you'd better use the Value property of a range (for example the ActiveRange of a WorkSheet). This contains a two dimensional array containing all the data. This increases reading performance with a factor 1000.

For the other part. I rewrote it in two parts, removing the Excel references:

        DateTime start = DateTime.Now;

        //using (FileStream fs = new FileStream(@"C:\temp\x.x", FileMode.Append, FileAccess.Write, FileShare.Write, 1, true))
        //{
        //    for (int row = 1; row < 3 * 1000; row++)
        //    {
        //        for (int column = 1; column < 3 * 1000; column++)
        //        {
        //            byte[] bytes = Encoding.ASCII.GetBytes(1.ToString() + ",");
        //            fs.Write(bytes, 0, bytes.Length);
        //        }

        //        byte[] bytes2 = Encoding.ASCII.GetBytes("\n");
        //        fs.Write(bytes2, 0, bytes2.Length);
        //    }
        //}

        using (TextWriter tw = new StreamWriter(new FileStream(@"C:\temp\x.x", FileMode.Append, FileAccess.Write, FileShare.Write, 1, true)))
        {
            for (int row = 1; row < 3 * 1000; row++)
            {
                for (int column = 1; column < 3 * 1000; column++)
                {
                    tw.Write(1.ToString());
                    tw.Write(',');
                }

                tw.WriteLine();
            }
        }

        DateTime end = DateTime.Now;

        MessageBox.Show(string.Format("Time spent: {0:N0} ms.", (end - start).TotalMilliseconds));

The first part (which is almost identical to your code, now commented out) takes 3.670 (yes, over three thousand) seconds. The second part (not commented out) takes 12 seconds.

OTHER TIPS

My experience with reading Excel from C# is, generally, nasty. All your computing time is spent trafficking with Excel - writing out CSV files takes no time at all. It isn't worth bothering with the separate threads.

In some cases I simply saved the spreadsheet as .csv and then parsed it from there. How this works from multiple sheets I don't know, but you might be able to page through the sheets saving them to .CSVs one by one. Then, read the .CSVs as long strings and clean them up.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top