Question

I have a vast quantity of data (>800Mb) that takes an age to load into Matlab mainly because it's split up into tiny files each <20kB. They are all in a proprietary format which I can read and load into Matlab, its just that it takes so long.

I am thinking of reading the data in and writing it out to some sort of binary file which should make it quicker for subsequent reads (of which there may be many, hence me needing a speed-up).

So, my question is, what would be the best format to write them to disk to make reading them back again as quick as possible?

I guess I have the option of writing using fwrite, or just saving the variables from matlab. I think I'd prefer the fwrite option so if needed, I could read them from another package/language...

Was it helpful?

Solution

Look in to the HDF5 data format, used by recent versions of MATLAB as the underlying format for .mat files. You can manually create your own HDF5 files using the hdf5write function, and this file can be accessed from any language that has HDF bindings (most common languages do, or at least offer a way to integrate C code that can call the HDF5 library).

If your data is numeric (and of the same datatype), you might find it hard to beat the performance of plain binary (fwrite).

OTHER TIPS

Binary mat-files are the fastest. Just use

save myfile.mat <var_a> <var_b> ...

I achieved an amazing speed up in loading when I used the '-v6' option to save the .mat files like so:

save(matlabTrainingFile, 'Xtrain', 'ytrain', '-v6'); 

Here's the size of the matrices that I used in my test ...

Attr Name                   Size                     Bytes  Class
==== ====                   ====                     =====  ===== 
  g  Xtest               1430x4000                45760000  double
  g  Xtrain              3411x4000               109152000  double
  g  Xval                1370x4000                43840000  double
  g  ytest               1430x1                      11440  double
  g  ytrain              3411x1                      27288  double
  g  yval                1370x1                      10960  double

... and the performance improvements that we achieved:

Before the change:

time to load the training data: 78 SECONDS!!! 
time to load validation data:   32
time to load the test data:     35

After the change:

time to load the training data: 0 SECONDS!!!
time to load validation data:   0
time to load the test data:     0

Apparently the reason the reason this works so well is that the old version 6 version used less compression the than new versions. So your file sizes will be bigger, but they will load WAY faster.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top