Dealing with very large datasets & just in time loading

https://stackoverflow.com/questions/4807048

24-10-2019
|

Question

I have a .NET application written in C# (.NET 4.0). In this application, we have to read a large dataset from a file and display the contents in a grid-like structure. So, to accomplish this, I placed a DataGridView on the form. It has 3 columns, all column data comes from the file. Initially, the file had about 600.000 records, corresponding to 600.000 lines in the DataGridView.

I quickly found out that, DataGridView collapses with such a large data-set, so I had switch to Virtual Mode. To accomplish this, I first read the file completely into 3 different arrays (corresponding to 3 columns), and then the CellValueNeeded event fires, I supply the correct values from the arrays.

However, there can be a huge (HUGE!) number of records in this file, as we quickly found out. When the record size is very large, reading all the data into an array or a List<>, etc, appears to not be feasible. We quickly run into memory allocation errors. (Out of memory exception).

We got stuck there, but then realized, why read the data all into arrays first, why not read the file on demand as CellValueNeeded event fires? So that's what we do now: We open the file, but do not read anything, and as CellValueNeeded events fire, we first Seek() to the correct position in the file, and then read the corresponding data.

This is the best we could come up with, but, first of all this is quite slow, which makes the application sluggish and not user friendly. Second, we can't help but think that there must be a better way to accomplish this. For example, some binary editors (like HXD) are blindingly fast for any file size, so I'd like know how this can be achieved.

Oh, and to add to our problems, in virtual mode of the DataGridView, when we set the RowCount to the available number of rows in the file (say 16.000.000), it takes a while for the DataGridView to even initialize itself. Any comments for this 'problem' would be appreciated as well.

Thanks

Solution

If you can't fit your entire data set in memory, then you need a buffering scheme. Rather than reading just the amount of data needed to fill the DataGridView in response to CellValueNeeded, your application should anticipate the user's actions and read ahead. So, for example, when the program first starts up, it should read the first 10,000 records (or maybe only 1,000 or perhaps 100,000--whatever is reasonable in your case). Then, CellValueNeeded requests can be filled immediately from memory.

As the user moves through the grid, your program as much as possible stays one step ahead of the user. There might be short pauses if the user jumps ahead of you (say, wants to jump to the end from the front) and you have to go out to disk in order to fulfill a request.

That buffering is usually best accomplished by a separate thread, although synchronization can sometimes be an issue if the thread is reading ahead in anticipation of the user's next action, and then the user does something completely unexpected like jump to the start of the list.

16 million records isn't really all that many records to keep in memory, unless the records are very large. Or if you don't have much memory on your server. Certainly, 16 million is nowhere near the maximum size of a List<T>, unless T is a value type (structure). How many gigabytes of data are you talking about here?

OTHER TIPS

Well, here's a solution that appears to work much better:

Step 0: Set dataGridView.RowCount to a low value, say 25 (or the actual number that fits in your form/screen)

Step 1: Disable the scrollbar of the dataGridView.

Step 2: Add your own scrollbar.

Step 3: In your CellValueNeeded routine, respond to e.RowIndex+scrollBar.Value

Step 4: As for the dataStore, I currently open a Stream, and in the CellValueNeeded routine, first do a Seek() and Read() the required data.

With these steps, I get very reasonable performance scrolling through the dataGrid for very large files (tested up to 0.8GB).

So in conclusion, it appears that the actual cause of the slowdown wasn't the fact that we kept Seek()ing and Read()ing, but the actual dataGridView itself.

Managing rows and columns that can be rolled up, sub-totalled, used in multi-column calculations, etc presents a unique set of challenges; not really fair to compare the problem to the ones an editor would encounter. Third-party datagrid controls have been addressing the problem of displaying and manipulating large datasets client-side since VB6 days. It's not a trivial task to get really snappy performance using either load-on-demand or self-contained client-side garguantuan datasets. Load-on-demand can suffer from server-side latency; manipulating the entire dataset on the client can suffer from memory and CPU limits. Some third-party controls that support just-in-time loading supply both client-side and server-side logic, while others try to solve the problem 100% client-side.

Because .net is layered on top of the native OS, runtime loading and management of data from disk to memory needs another approach. See why and how: http://www.codeproject.com/Articles/38069/Memory-Management-in-NET

To deal with this issue, I would suggest to do not load all data at once. Instead load data in chunks and display the most relevant data when needed. I just did a quick test and found that setting a DataSource property of a DataGridView is a good approach, but with large number of rows it also takes time. So use Merge function of DataTable to load data in chunks and show user the most relevant data. Here i have demonstrated an example which can help you.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow