more efficient way to pickle a string

https://stackoverflow.com/questions/695794

22-08-2019
|

Question

The pickle module seems to use string escape characters when pickling; this becomes inefficient e.g. on numpy arrays. Consider the following

z = numpy.zeros(1000, numpy.uint8)
len(z.dumps())
len(cPickle.dumps(z.dumps()))

The lengths are 1133 characters and 4249 characters respectively.

z.dumps() reveals something like "\x00\x00" (actual zeros in string), but pickle seems to be using the string's repr() function, yielding "'\x00\x00'" (zeros being ascii zeros).

i.e. ("0" in z.dumps() == False) and ("0" in cPickle.dumps(z.dumps()) == True)

Solution

Try using a later version of the pickle protocol with the protocol parameter to pickle.dumps(). The default is 0 and is an ASCII text format. Ones greater than 1 (I suggest you use pickle.HIGHEST_PROTOCOL). Protocol formats 1 and 2 (and 3 but that's for py3k) are binary and should be more space conservative.

OTHER TIPS

Solution:

import zlib, cPickle

def zdumps(obj):
  return zlib.compress(cPickle.dumps(obj,cPickle.HIGHEST_PROTOCOL),9)

def zloads(zstr):
  return cPickle.loads(zlib.decompress(zstr))  

>>> len(zdumps(z))
128

z.dumps() is already pickled string i.e., it can be unpickled using pickle.loads():

>>> z = numpy.zeros(1000, numpy.uint8)
>>> s = z.dumps()
>>> a = pickle.loads(s)
>>> all(a == z)
True

An improvement to vartec's answer, that seems a bit more memory efficient (since it doesn't force everything into a string):

def pickle(fname, obj):
    import cPickle, gzip
    cPickle.dump(obj=obj, file=gzip.open(fname, "wb", compresslevel=3), protocol=2)

def unpickle(fname):
    import cPickle, gzip
    return cPickle.load(gzip.open(fname, "rb"))

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow