Matlab: direct/efficient untar to memory to avoid slow disk interactions

Question 1

After some time I finally worked out a solution which gave me unbelievable speedups (like 10x or so).

In a word: ramdisk (tested on Linux (Ubuntu & CentOs)).

Recap:

Since the problem has some generality, let me state it again in a more complete fashion.

Say that I have many small files stored on disk (txt,pict, order of millions) which I want to manipulate (e.g. via matlab).

Working on such files (i.e. loading them/transmitting them on network) when they are stored on disk is tremendously slow since the disk access is mostly random.

Hence, tarballing the files in archives (e.g. of fixed size) looked to me like a good way to keep the disk access sequential.

Problem:

If case the manipulation of the .tar requires a preliminary extraction to disk (as it happens with matlab's UNTAR), the speed up given by sequential disk access is mostly loss.

Workaround:

The tarball (provided it is reasonably small) can be extracted to memory and then processed from there. In matlab, as I stated in the question, .tar manipulation in memory is not possible, though.

What can be done (equivalently) is untarring to ramdisk.

In linux, e.g. Ubuntu, a default ramdisk drive is mounted in /run/shm (tempfs). Files can be untarred via matlab there, having then extremely fast access.

In other words, a possible workcycle is:

untar to /run/shm/mytemp
manipulate in memory
possibly tar again the output to disk

This allowed me to change the scale-time of my processing from 8hrs to 40min and full CPUs load.

Question 2

No, as far as i know.

If you're using Matlab on Linux, try to extract to tmpnam. This will extract to tmpfs whitch should be faster accessible (bad idea if we are takling about several GB).

Otherway you can use system('untar xf file.tar only/needed/file') or python to get a more flexible untar behavior.