After some time I finally worked out a solution which gave me unbelievable speedups (like 10x or so).
In a word: ramdisk (tested on Linux (Ubuntu & CentOs)).
Recap:
Since the problem has some generality, let me state it again in a more complete fashion.
Say that I have many small files stored on disk (txt,pict, order of millions) which I want to manipulate (e.g. via matlab).
Working on such files (i.e. loading them/transmitting them on network) when they are stored on disk is tremendously slow since the disk access is mostly random.
Hence, tarballing the files in archives (e.g. of fixed size) looked to me like a good way to keep the disk access sequential.
Problem:
If case the manipulation of the .tar
requires a preliminary extraction to disk (as it happens with matlab's UNTAR
), the speed up given by sequential disk access is mostly loss.
Workaround:
The tarball (provided it is reasonably small) can be extracted to memory and then processed from there. In matlab, as I stated in the question, .tar
manipulation in memory is not possible, though.
What can be done (equivalently) is untarring
to ramdisk.
In linux, e.g. Ubuntu, a default ramdisk drive is mounted in /run/shm
(tempfs
). Files can be untarred via matlab there, having then extremely fast access.
In other words, a possible workcycle is:
untar
to/run/shm/mytemp
- manipulate in memory
- possibly
tar
again the output to disk
This allowed me to change the scale-time of my processing from 8hrs
to 40min
and full CPUs load.