MATLAB: Saving several variables to “-v7.3” (HDF5) .mat-files seems to be faster when using the “-append” flag. How come?

https://stackoverflow.com/questions/4949939

07-11-2019
|

Question

NOTE: This question deals with an issue observed back in 2011 with an old MATLAB version (R2009a). As per the update below from July 2016, the issue/bug in MATLAB seems to no longer exist (tested with R2016a; scroll down to end of question to see update).

I am using MATLAB R2009b and I need to write a larger script that converts the contents of a larger set of .zip files to v7.3 mat files (with an underlying HDF5-datamodel). Reading is OK. The issue is with saving. And there is actually no problem. My files saves nicely using the save command.

My question is more in the sense: Why am I observing the following surprising (for me) behavior in MATLAB?

let's look at my issue in general. In this current test-scenario I will be generating one output: A -v7.3 mat-file. This .mat-file will contain 40 blocks as individual variables. Each variable will be named "block_NNN" from 1 to 40 and will contain a struct with fields frames and blockNo. Field frames contains a 480x240x65 sequence of uint8 imagedata (here just random data generated using randi). Field blockNo contains the block number.

Remark: In the real script (that I have yet to finish) I will be doing the above at total of 370 times, converting a total of 108GB of raw data. Which is why I am concerned with the following.

Anyway, first I define some general variables:

% some sizes for dummy data and loops:
num_blockCount = 40;
num_blockLength = 65;
num_frameHeight = 480;
num_frameWidth = 240;

I then generate some dummy code that has shape and size identical to the actual raw data:

% generate empty struct:
stu_data2disk = struct();

% loop over blocks:
for num_k = 1:num_blockCount

   % generate block-name:
   temp_str_blockName = sprintf('block_%03u', num_k);

   % generate temp struct for current block:
   temp_stu_value = struct();
   temp_stu_value.frames = randi( ...
      [0 255], ...
      [num_frameHeight num_frameWidth num_blockLength], ...
      'uint8' ...
   );
   temp_stu_value.blockNo = num_k;

   % using dynamic field names:
   stu_data2disk.(sprintf('block_%03u', num_k)) = temp_stu_value;

end

I now have all my random test-data in a struct stu_data2disk. Now I would like to save the data using one of two possible methods.

Let's try the simple one first:

% save data (simple):
disp('Save data the simple way:')
tic;
save converted.mat -struct stu_data2disk -v7.3;
toc;

The file is written without problems (286MB). The output is:

Save data the simple way:
Elapsed time is 14.004449 seconds.

OK - then I remembered that I would like to follow the save-procedure over the 40 blocks. Thus instead of the above I loop over the blocks and append them in sequence:

% save to file, using append:
disp('Save data using -append:')
tic;
for num_k = 1:num_blockCount

   % generate block-name:
   temp_str_blockName = sprintf('block_%03u', num_k);

   temp_str_appendToggle = '';
   if (num_k > 1)
      temp_str_appendToggle = '-append';
   end

   % generate save command:
   temp_str_saveCommand = [ ...
      'save ', ...
      'converted_append.mat ', ...
      '-struct stu_data2disk ', temp_str_blockName, ' '...
      temp_str_appendToggle, ' ', ...
      '-v7.3', ...
      ';' ...
   ];

   % evaluate save command:
   eval(temp_str_saveCommand);

end
toc;

And again the file saves nicely (286MB). The output is:

Save data using -append:
Elapsed time is 0.956968 seconds.

Interestingly the append-method is much faster? My question is why?

Output from dir converted*.mat:

09-02-2011  20:38       300,236,392 converted.mat
09-02-2011  20:37       300,264,316 converted_append.mat
               2 File(s)    600,500,708 bytes

The files are not identical in size. And a test with fc in windows 7 revealed ... well many binary differences. Perhaps the data was shifted a bit - thus this tells us nothing.

Does someone have an idea what is going on here? Is the appended file using a much more optimized data-structure perhaps? Or maybe windows has cached the file and makes access to it much faster?

I made the effort of test-reading from the two files as well. Without presenting the numbers here the appended version was a little bit faster (could mean something in the long run though).

[EDIT]: I just tried using no format flag (defaults to -v7 on my system) and there is not much difference anymore:

Save data the simple way (-v7):
Elapsed time is 13.092084 seconds.
Save data using -append (-v7):
Elapsed time is 14.345314 seconds.

[EDIT]: I corrected the above mistake. Previously I mentioned that the stats were for -v6 but I was mistaken. I had just removed the format flag and assumed the default was -v6 but actually it is -v7.

I have created new test stats for all formats on my system using Andrew's fine framework (all formats are for the same random test data, now read from file):

15:15:51.422: Testing speed, format=-v6, R2009b on PCWIN, arch=x86, os=Microsoft Windows 7 Professional  6.1.7600 N/A Build 7600
15:16:00.829: Save the simple way:            0.358 sec
15:16:01.188: Save using multiple append:     7.432 sec
15:16:08.614: Save using one big append:      1.161 sec

15:16:24.659: Testing speed, format=-v7, R2009b on PCWIN, arch=x86, os=Microsoft Windows 7 Professional  6.1.7600 N/A Build 7600
15:16:33.442: Save the simple way:           12.884 sec
15:16:46.329: Save using multiple append:    14.442 sec
15:17:00.775: Save using one big append:     13.390 sec

15:17:31.579: Testing speed, format=-v7.3, R2009b on PCWIN, arch=x86, os=Microsoft Windows 7 Professional  6.1.7600 N/A Build 7600
15:17:40.690: Save the simple way:           13.751 sec
15:17:54.434: Save using multiple append:     3.970 sec
15:17:58.412: Save using one big append:      6.138 sec

And the sizes of the files:

10-02-2011  15:16       299,528,768 converted_format-v6.mat
10-02-2011  15:16       299,528,768 converted_append_format-v6.mat
10-02-2011  15:16       299,528,832 converted_append_batch_format-v6.mat
10-02-2011  15:16       299,894,027 converted_format-v7.mat
10-02-2011  15:17       299,894,027 converted_append_format-v7.mat
10-02-2011  15:17       299,894,075 converted_append_batch_format-v7.mat
10-02-2011  15:17       300,236,392 converted_format-v7.3.mat
10-02-2011  15:17       300,264,316 converted_append_format-v7.3.mat
10-02-2011  15:18       300,101,800 converted_append_batch_format-v7.3.mat
               9 File(s)  2,698,871,005 bytes

Thus -v6 seems to be the fastest for writing. Also not any large differences in files sizes. HDF5 does have some basic inflate-method built-in as far as I know.

Hmm, probably some optimization in the underlying HDF5-write functions?

Currently I still think that some underlying fundamental HDF5-write function is optimized for adding datasets to an HDF5-file (which is what happens when adding new variables to a -7.3 file). I believe I have read somewhere that HDF5 should optimized in this very way... though cannot be sure.

Other details to note:

The behavior is very systemic as we see in Andrew's answer below. It also seems to be quite important as to whether or not you run these things in a local scope of a function or in the "global" of an m-script. My first results were from an m-script where files were written to the current directory. I can still only reproduce the 1-second write for -7.3 in the m-script. The function-calls add some overhead apparently.

Update July 2016:

I found this again and thought I might test it with the newest MATLAB available to me at the moment. With MATLAB R2016a on Windows 7 x64 the problem seems to have been fixed:

14:04:06.277: Testing speed, imax=255, R2016a on PCWIN64, arch=AMD64, 16 GB, os=Microsoft Windows 7 Enterprise  Version 6.1 (Build 7601: Service Pack 1)
14:04:10.600: basic -v7.3:                    7.599 sec      5.261 GB used
14:04:18.229: basic -v7.3:                    7.894 sec      5.383 GB used
14:04:26.154: basic -v7.3:                    7.909 sec      5.457 GB used
14:04:34.096: basic -v7.3:                    7.919 sec      5.498 GB used
14:04:42.048: basic -v7.3:                    7.886 sec      5.516 GB used     286 MB file   7.841 sec mean
14:04:50.581: multiappend -v7.3:              7.928 sec      5.819 GB used
14:04:58.544: multiappend -v7.3:              7.905 sec      5.834 GB used
14:05:06.485: multiappend -v7.3:              8.013 sec      5.844 GB used
14:05:14.542: multiappend -v7.3:              8.591 sec      5.860 GB used
14:05:23.168: multiappend -v7.3:              8.059 sec      5.868 GB used     286 MB file   8.099 sec mean
14:05:31.913: bigappend -v7.3:                7.727 sec      5.837 GB used
14:05:39.676: bigappend -v7.3:                7.740 sec      5.879 GB used
14:05:47.453: bigappend -v7.3:                7.645 sec      5.884 GB used
14:05:55.133: bigappend -v7.3:                7.656 sec      5.877 GB used
14:06:02.824: bigappend -v7.3:                7.963 sec      5.871 GB used     286 MB file   7.746 sec mean

This was tested with Andrew Janke's reproMatfileAppendSpeedup function in the accepted answer below (5 passes with format 7.3). Now, -append is equally slow, or slower, to a single save - as it should be. Perhaps it was a problem with an early build of the HDF5 driver used in R2009a.

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow