Processing excel files with data reader: ExecuteReader() buffers entire file

https://stackoverflow.com/questions/20243111

05-08-2022
|

Domanda

I'm running into a peculiar issue when trying to process large excel files (300mb+) using a data reader. The following code illustrates the way I open the excel file and iterate over the rows in sheet 'largesheet$':

const string inputFilePath = @"C:\largefile.xlsx";
const string connectionString =
    "Provider=Microsoft.ACE.OLEDB.12.0;Extended Properties=\"Excel 12.0;IMEX=1;HDR=YES;\";Data Source=" +
    inputFilePath;

// Initialize connection
using (var connection = new OleDbConnection(connectionString))
{
    // Open connection
    connection.Open();

    // Configure command
    var command = new OleDbCommand("largesheet$", connection) {CommandType = CommandType.TableDirect};

    // Execute reader
    var reader = command.ExecuteReader(); // <-- Completely loads file/sheet into memory

    // Iterate results
    while (reader.HasRows)
    {
        // Read single row
        reader.Read();

        // ...
    }

    // Close connection
    connection.Close();
}

In my understanding this should open the excel file and load each row when needed by using the reader.Read() statement.
However, it appears that the ExecuteReader() statement does more than returning an OleDbDataReader instance. Using breakpoints I noticed that that one statement takes 30s+, and the Windows Resource Monitor indicates a steady increase of allocated memory during the execution of that statement.
Specifying the CommandBehavior parameter (e.g. SequentialAccess) of the ExecuteReader() method has no effect.

What am I doing wrong here? Are there alternative ways of processing large (excel) files?

Note: the IMEX & HDR extended properties of the connection string are intentional.

Edit: After some rational thinking I assume it is not possible to process an excel file without buffering it one way or another. Since excel files are basically a glorified collection of compressed XML files it is not possible to process a worksheet without decompressing it (and keeping it in ram or temporarily saving to disk).
The only alternative I can think of is using Microsoft.Office.Interop.Excel. Not sure how OpenXML handles it though.

Soluzione

From MSDN: "All rows and columns of the named table or tables will be returned when you call one of the Execute methods of a Command object." (under the Remarks section). So this would appear to be the default behavior of ExecuteReader().

ExecuteReader(CommandBehavior) may give you more options, particularly when CommandBehavior is set to SequentialAccess, though you would need to handle reading at the byte level.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow