Do you know any Python libraries good for storing large unstructured lists (for instance a=[0,numpy.asarray([1,2,3])])?

What I've seen so far h5py doesn't support this kind of data, and pickle seems to be slow. Any other alternatives?

For my purpose we can assume that I am dealing with data of the same type (numpy arrays with int type) but different shape.

有帮助吗?

解决方案

Actually you can store and retrieve this kind of data into a hdf5 file with just a little bit custom logic:

import tables
import numpy as np

def store(filename, name, data):
    with tables.openFile(filename, 'w') as store:
        store.createGroup('/', name)
        for i, item in enumerate(data):
            store.createArray('/%s' % name, 'item_%s' % i, item)

def read(filename, name):
    with tables.openFile(filename, 'r') as store:
        nodes = store.listNodes('/%s' % name)
        data = [0] * len(nodes)
        for node in nodes:
            pos = int(node.name.split('_')[-1])
            data[pos] = node.read()
        return data

Usage:

>>> a = [0, np.array([4,5,6])]
>>> store('my_data.h5', 'a', a)
>>> print read('my_data.h5', 'a')
[0, array([4, 5, 6])]

This is just the first thing that fall on my mind, I'm sure there is a more efficient pattern of storing list into hdf5 files. But let's time it and see if even this naive implementation is faster than cPickle:

In [7]: a = []
        for i in range(1, 500):
            if i % 10 == 0:
                a.append(i)
            else:
                a.append(np.random.randn(i, i))
In [8]: %%timeit
        store('my_data.h5', 'a', a)
        read_data = read('my_data.h5', 'a')
1 loops, best of 3: 1.32 s per loop
In [9]: %%timeit
        with open('test.pickle', 'wb') as f:
            cPickle.dump(a, f)
        with open('test.pickle', 'rb') as f:
            read_data = cPickle.load(f)
1 loops, best of 3: 1min 58s per loop

Depending on the data the difference is even bigger or a little bit smaller. But even this stupid implementation is at least 10x faster than cPickle for any data that contains numpy arrays.

其他提示

If you think that Pickle and cPickle are too slow you should look into either Marshall or Shelve as they are the two other major off-the-shelf serialization libraries. If that doesn't work for you you're going to want to start using a legitimate database.

After all, the ability to store and retrieve large amounts of data quickly is basically what a database is, and these compression modules are only going to get you so far towards that. If they were perfect you wouldn't need databases.

If you don't want to use either of those there are actually tools out there specifically for this purpose, but I feel like it would be a one-off. You can look here for one such service, but there are a couple more.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top