Question

I'm looking at using the amazon cloud for all my simulation needs. The resulting sim files are quite large, and I would like to move them over to my local drive for ease of analysis, ect. You have to pay per data you move over, so I want to compress all my sim soutions as small as possible. They are simply numpy arrays saved in the form of .mat files, using:

import scipy.io as sio
sio.savemat(filepath, do_compression = True) 

So my question is, what is the best way to compress numpy arrays (they are currently stored in .mat files, but I could store them using any python method), by using python compression saving, linux compression, or both?

I am in the linux environment, and I am open to any kind of file compression.

Was it helpful?

Solution

Unless you know something special about the arrays (e.g. sparseness, or some pattern) you aren't going to do much better than the default compression, and maybe gzip on top of that. In fact you may not even need to gzip the files if you're using HTTP for downloads and your server is configured to do compression. Good lossless compression algorithms rarely vary by more than 10%.

If savemat works as advertized you should be able to get gzip compression all in python with:

import scipy.io as sio
import gzip

f_out = gzip.open(filepath_dot_gz, 'wb')
sio.savemat(f_out, do_compression = True)

OTHER TIPS

Also LZMA (AKA xz) gives very good compression on fairly sparse numpy arrays, albeit it is pretty slow when compressing (and may require more memory as well).

In Ubuntu it is installed with sudo apt-get install python-lzma

It is used as any other file-object wrapper, something like that (to load pickled data):

from lzma import LZMAFile
import cPickle as pickle

if fileName.endswith('.xz'):
   dataFile = LZMAFile(fileName,'r')
else:
   dataFile = file(fileName, 'ro')     
data = pickle.load(dataFile)

Though it won't necessarily give you the highest compression ratios, I've had good experiences saving compressed numpy arrays to disk with python-blosc. It is very fast and integrates well with numpy.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top