Python particles simulator: out-of-core processing

Question 1

PyTable Solution

Since functionality provided by Pandas is not needed, and the processing is much slower (see notebook below), the best approach is using PyTables or h5py directly. I've tried only the pytables approach so far.

All tests were performed in this notebook:

Python particles simulator: numpy out-of-core processing

Introduction to pytables data-structures

Reference: Official PyTables Docs

Pytables allows store data in HDF5 files in 2 types of formats: arrays and tables.

Arrays

There are 3 types of arrays Array, CArray and EArray. They all allow to store and retrieve (multidimensional) slices with a notation similar to numpy slicing.

# Write data to store (broadcasting works)
array1[:]  = 3

# Read data from store
in_ram_array = array1[:]

For optimization in some use cases, CArray is saved in "chunks", whose size can be chosen with chunk_shape at creation time.

Array and CArray size is fixed at creation time. You can fill/write the array chunk-by-chunk after creation though. Conversely EArray can be extended with the .append() method.

Tables

The table is a quite different beast. It's basically a "table". You have only 1D index and each element is a row. Inside each row there are the "columns" data types, each columns can have a different type. It you are familiar with numpy record-arrays, a table is basically an 1D record-array, with each element having many fields as the columns.

1D or 2D numpy arrays can be stored in tables but it's a bit more tricky: we need to create a row data type. For example to store an 1D uint8 numpy array we need to do:

table_uint8 = np.dtype([('field1', 'u1')])
table_1d = data_file.create_table('/', 'array_1d', description=table_uint8)

So why using tables? Because, differently from arrays, tables can be efficiently queried. For example, if we want to search for elements > 3 in a huge disk-based table we can do:

index = table_1d.get_where_list('field1 > 3')

Not only it is simple (compared with arrays where we need to scan the whole file in chunks and build index in a loop) but it is also very extremely fast.

How to store simulation parameters

The best way to store simulation parameters is to use a group (i.e. /parameters), convert each scalar to numpy array and store it as CArray.

Array for "`emission`"

emission is the biggest array that is generated and read sequentially. For this usage pattern A good data structure is EArray. On "simulated" data with ~50% of zeros elements blosc compression (level=5) achieves 2.2x compression ratio. I found that a chunk-size of 2^18 (256k) has the minimum processing time.

Storing "`counts`"

Storing also "counts" will increase the file size by 10% and will take 40% more time to compute timestamps. Having counts stored is not an advantage per-se because only the timestamps are needed in the end.

The advantage is that recostructing the index (timestamps) is simpler because we query the full time axis in a single command (.get_where_list('counts >= 1')). Conversely, with chunked processing, we need to perform some index arithmetics that is a bit tricky, and maybe a burden to maintain.

However the the code complexity may be small compared to all the other operations (sorting and merging) that are needed in both cases.

Storing "`timestamps`"

Timestamps can be accumulated in RAM. However, we don't know the arrays size before starting and a final hstack() call is needed to "merge" the different chunks stored in a list. This doubles the memory requirements so the RAM may be insufficient.

We can store as-we-go timestamps to a table using .append(). At the end we can load the table in memory with .read(). This is only 10% slower than all-in-memory computation but avoids the "double-RAM" requirement. Moreover we can avoid the final full-load and have minimal RAM usage.

H5Py

H5py is a much simpler library than pytables. For this use-case of (mainly) sequential processing seems a better fit than pytables. The only missing feature is the lack of 'blosc' compression. If this results in a big performance penalty remains to be tested.

Question 2

Dask.array can perform chunked operations like max, cumsum, etc. on an on-disk array like PyTables or h5py.

import h5py
d = h5py.File('myfile.hdf5')['/data']
import dask.array as da
x = da.from_array(d, chunks=(1000, 1000))

X looks and feels like a numpy array and copies much of the API. Operations on x will create a DAG of in-memory operations which will execute efficiently using multiple cores streaming from disk as necessary

da.exp(x).mean(axis=0).compute()

http://dask.pydata.org/en/latest/

conda install dask
or 
pip install dask

Question 3

See here for how to store your parameters in the HDF5 file (it pickles, so you can store them how you have them; their is a 64kb limit on the size of the pickle).

import pandas as pd                                                                                                                                                                                                                                                                                               
import numpy as np                                                                                                                                                                                                                                                                                                

n_particles = 10                                                                                                                                                                                                                                                                                                  
chunk_size = 1000                                                                                                                                                                                                                                                                                                 

# 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   

# generate simulated data                                                                                                                                                                                                                                                                                         
for i in range(10):                                                                                                                                                                                                                                                                                               

    df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            

    # create a globally unique index (time)                                                                                                                                                                                                                                                                       
    # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-

    data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
        try:                                                                                                                                                                                                                                                                                                          
            nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
        except:                                                                                                                                                                                                                                                                                                       
            nrows = 0                                                                                                                                                                                                                                                                                                 

        df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
        emission.append('df',df)                                                                                                                                                                                                                                                                                      

    emission.close()                                                                                                                                                                                                                                                                                                  

    # 2) create counts                                                                                                                                                                                                                                                                                                
    cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           

    # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
    for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         

        counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             

        # set the index as the same                                                                                                                                                                                                                                                                                   
        counts.index = df.index                                                                                                                                                                                                                                                                                       

        # store the sum across all particles (as most are zero this will be a 
        # nice sub-selector                                                                                                                                                                                                                       
        # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
        # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
        # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
        counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
        counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           

        # make the non_zero column indexable                                                                                                                                                                                                                                                                          
        cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         

    cs.close()                                                                                                                                                                                                                                                                                                        

    # 3) find interesting counts                                                                                                                                                                                                                                                                                      
    print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
    print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')

You can alternatively, make each particle a data_column and select on them individually.

and some output (pretty active emission in this case :)

    0  1  2  3  4  5  6  7  8  9  particles_0_4  particles_5_9
0   2  2  2  3  2  1  0  2  1  0              9              4
1   1  0  0  0  1  0  1  0  3  0              1              4
2   0  2  0  0  2  0  0  1  2  0              2              3
3   0  0  0  1  1  0  0  2  0  3              1              2
4   3  1  0  2  1  0  0  1  0  0              6              1
5   1  0  0  1  0  0  0  3  0  0              2              3
6   0  0  0  1  1  0  1  0  0  0              1              1
7   0  2  0  2  0  0  0  0  2  0              4              2
8   0  0  0  1  3  0  0  0  0  1              1              0
10  1  0  0  0  0  0  0  0  0  1              1              0
11  0  0  1  1  0  2  0  1  2  1              2              5
12  0  2  2  4  0  0  1  1  0  1              8              2
13  0  2  1  0  0  0  0  1  1  0              3              2
14  1  0  0  0  0  3  0  0  0  0              1              3
15  0  0  0  1  1  0  0  0  0  0              1              0
16  0  0  0  4  3  0  3  0  1  0              4              4
17  0  2  2  3  0  0  2  2  0  2              7              4
18  0  1  2  1  0  0  3  2  1  2              4              6
19  1  1  0  0  0  0  1  2  1  1              2              4
20  0  0  2  1  2  2  1  0  0  1              3              3
22  0  1  2  2  0  0  0  0  1  0              5              1
23  0  2  4  1  0  1  2  0  0  2              7              3
24  1  1  1  0  1  0  0  1  2  0              3              3
26  1  3  0  4  1  0  0  0  2  1              8              2
27  0  1  1  4  0  1  2  0  0  0              6              3
28  0  1  0  0  0  0  0  0  0  0              1              0
29  0  2  0  0  1  0  1  0  0  0              2              1
30  0  1  0  2  1  2  0  2  1  1              3              5
31  0  0  1  1  1  1  1  0  1  1              2              3
32  3  0  2  1  0  0  1  0  1  0              6              2
33  1  3  1  0  4  1  1  0  1  4              5              3
34  1  1  0  0  0  0  0  3  0  1              2              3
35  0  1  0  0  1  1  2  0  1  0              1              4
36  1  0  1  0  1  2  1  2  0  1              2              5
37  0  0  0  1  0  0  0  0  3  0              1              3
38  2  5  0  0  0  3  0  1  0  0              7              4
39  1  0  0  2  1  1  3  0  0  1              3              4
40  0  1  0  0  1  0  0  4  2  2              1              6
41  0  3  3  1  1  2  0  0  2  0              7              4
42  0  1  0  2  0  0  0  0  0  1              3              0
43  0  0  2  0  5  0  3  2  1  1              2              6
44  0  2  0  1  0  0  1  0  0  0              3              1
45  1  0  0  2  0  0  0  1  4  0              3              5
46  0  2  0  0  0  0  0  1  1  0              2              2
48  3  0  0  0  0  1  1  0  0  0              3              2
50  0  1  0  1  0  1  0  0  2  1              2              3
51  0  0  2  0  0  0  2  3  1  1              2              6
52  0  0  2  3  2  3  1  0  1  5              5              5
53  0  0  0  2  1  1  0  0  1  1              2              2
54  0  1  2  2  2  0  1  0  2  0              5              3
55  0  2  1  0  0  0  0  0  3  2              3              3
56  0  1  0  0  0  2  2  0  1  1              1              5
57  0  0  0  1  1  0  0  1  0  0              1              1
58  6  1  2  0  2  2  0  0  0  0              9              2
59  0  1  1  0  0  0  0  0  2  0              2              2
60  2  0  0  0  1  0  0  1  0  1              2              1
61  0  0  3  1  1  2  0  0  1  0              4              3
62  2  0  1  0  0  0  0  1  2  1              3              3
63  2  0  1  0  1  0  1  0  0  0              3              1
65  0  0  1  0  0  0  1  5  0  1              1              6
   .. .. .. .. .. .. .. .. .. ..            ...            ...

[9269 rows x 12 columns]

Question 4

Use OpenMM to simulate particles (https://github.com/SimTk/openmm) and MDTraj (https://github.com/rmcgibbo/mdtraj) to handle trajectory IO.

Question 5

The pytables vs pandas.HDFStore tests in the accepted answer is completely misleading:

The first critical error is the timing did not apply os.fsync after flush, which make the speed test unstable. So sometime, the pytables function is accidentally much faster.
The 2nd problem is the pytables and pandas versions have completely different shapes due to misunderstanding the pytables.EArray's shape arg. The author try to append column into pytables version but append row into pandas version.
The 3rd problem is the author used different chunkshape during comparison.
The author also forgot to disable the table index generation during store.append() which is a time consuming process.

The follow table showed the performance results from his version and my fixes. tbold is his pytables version, pdold is his pandas version. tbsync and pdsync are his version with fsync() after flush() and also disable the table index generation during append. the tbopt and pdopt are my optimized version, with blosc:lz4 and complevel 9.

| name   |   dt |   data size [MB] |   comp ratio % | chunkshape   | shape         | clib            | indexed   |
|:-------|-----:|-----------------:|---------------:|:-------------|:--------------|:----------------|:----------|
| tbold  | 5.11 |           300.00 |          84.63 | (15, 262144) | (15, 5242880) | blosc[5][1]     | False     |
| pdold  | 8.39 |           340.00 |          39.26 | (1927,)      | (5242880,)    | blosc[5][1]     | True      |
| tbsync | 7.47 |           300.00 |          84.63 | (15, 262144) | (15, 5242880) | blosc[5][1]     | False     |
| pdsync | 6.97 |           340.00 |          39.27 | (1927,)      | (5242880,)    | blosc[5][1]     | False     |
| tbopt  | 4.78 |           300.00 |          43.07 | (4369, 15)   | (5242880, 15) | blosc:lz4[9][1] | False     |
| pdopt  | 5.73 |           340.00 |          38.53 | (3855,)      | (5242880,)    | blosc:lz4[9][1] | False     |

The pandas.HDFStore uses pytables under the hood. Thus if we use them correctly, they should have no difference at all.

We can see the pandas version has larger data size. This is because the pandas use pytables.Table instead of EArray. And the pandas.DataFrame always have an index column. The first column of the Table object is this DataFrame index which require some extra space to save. This only affect IO performance a little but provide more features such as out-of-core query. So I still recommend pandas here. @MRocklin also mentioned a nicer out-of-core package dask, if most features you used are just array operations instead of table-like query. But the IO performance won't have distinguishable difference.

h5f.root.emission._v_attrs
Out[82]: 
/emission._v_attrs (AttributeSet), 15 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := [],
    encoding := 'UTF-8',
    index_cols := [(0, 'index')],
    info := {1: {'names': [None], 'type': 'RangeIndex'}, 'index': {}},
    levels := 1,
    metadata := [],
    nan_rep := 'nan',
    non_index_axes := [(1, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])],
    pandas_type := 'frame_table',
    pandas_version := '0.15.2',
    table_type := 'appendable_frame',
    values_cols := ['values_block_0']]

Here is the functions:

def generate_emission(shape):
    """Generate fake emission."""
    emission = np.random.randn(*shape).astype('float32') - 1
    emission.clip(0, 1e6, out=emission)
    assert (emission >=0).all()
    return emission



def test_puretb_earray(outpath,
                        n_particles = 15,
                        time_chunk_size = 2**18,
                        n_iter = 20,
                       sync = True,
                       clib = 'blosc',
                       clevel = 5,
                       ):

    time_size = n_iter * time_chunk_size
    data_file = pytb.open_file(outpath, mode="w")
    comp_filter = pytb.Filters(complib=clib, complevel=clevel)
    emission = data_file.create_earray('/', 'emission', atom=pytb.Float32Atom(),
                                       shape=(n_particles, 0),
                                       chunkshape=(n_particles, time_chunk_size),
                                       expectedrows=n_iter * time_chunk_size,
                                       filters=comp_filter)

    # generate simulated emission data
    t0 =time()
    for i in range(n_iter):
        emission_chunk = generate_emission((n_particles, time_chunk_size))
        emission.append(emission_chunk)

    emission.flush()
    if sync:
        os.fsync(data_file.fileno())
    data_file.close()
    t1 = time()
    return t1-t0


def test_puretb_earray2(outpath,
                        n_particles = 15,
                        time_chunk_size = 2**18,
                        n_iter = 20,
                       sync = True,
                       clib = 'blosc',
                       clevel = 5,
                       ):

    time_size = n_iter * time_chunk_size
    data_file = pytb.open_file(outpath, mode="w")
    comp_filter = pytb.Filters(complib=clib, complevel=clevel)
    emission = data_file.create_earray('/', 'emission', atom=pytb.Float32Atom(),
                                       shape=(0, n_particles),
                                       expectedrows=time_size,
                                       filters=comp_filter)

    # generate simulated emission data
    t0 =time()
    for i in range(n_iter):
        emission_chunk = generate_emission((time_chunk_size, n_particles))
        emission.append(emission_chunk)

    emission.flush()

    if sync:
        os.fsync(data_file.fileno())
    data_file.close()
    t1 = time()
    return t1-t0


def test_purepd_df(outpath,
                   n_particles = 15,
                   time_chunk_size = 2**18,
                   n_iter = 20,
                   sync = True,
                   clib='blosc',
                   clevel=5,
                   autocshape=False,
                   oldversion=False,
                   ):
    time_size = n_iter * time_chunk_size
    emission = pd.HDFStore(outpath, mode='w', complib=clib, complevel=clevel)
    # generate simulated data
    t0 =time()
    for i in range(n_iter):

        # Generate fake emission
        emission_chunk = generate_emission((time_chunk_size, n_particles))

        df = pd.DataFrame(emission_chunk, dtype='float32')

        # create a globally unique index (time)
        # http://stackoverflow.com/questions/16997048/how-does-one-append-large-
        # amounts-of-data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397
        try:
            nrows = emission.get_storer('emission').nrows
        except:
            nrows = 0

        df.index = pd.Series(df.index) + nrows

        if autocshape:
            emission.append('emission', df, index=False,
                            expectedrows=time_size
                            )
        else:
            if oldversion:
                emission.append('emission', df)
            else:
                emission.append('emission', df, index=False)

    emission.flush(fsync=sync)
    emission.close()
    t1 = time()
    return t1-t0


def _test_puretb_earray_nosync(outpath):
    return test_puretb_earray(outpath, sync=False)


def _test_purepd_df_nosync(outpath):
    return test_purepd_df(outpath, sync=False,
                          oldversion=True
                          )


def _test_puretb_earray_opt(outpath):
    return test_puretb_earray2(outpath,
                               sync=False,
                               clib='blosc:lz4',
                               clevel=9
                               )


def _test_purepd_df_opt(outpath):
    return test_purepd_df(outpath,
                          sync=False,
                          clib='blosc:lz4',
                          clevel=9,
                          autocshape=True
                          )


testfns = {
    'tbold':_test_puretb_earray_nosync,
    'pdold':_test_purepd_df_nosync,
    'tbsync':test_puretb_earray,
    'pdsync':test_purepd_df,
    'tbopt': _test_puretb_earray_opt,
    'pdopt': _test_purepd_df_opt,
}

Python particles simulator: out-of-core processing

Problem description

Question

What I would like it exist

What are the practical options

Tentative solution (pyTables)

PyTable Solution

Introduction to pytables data-structures

Arrays

Tables

How to store simulation parameters

Array for "`emission`"

Storing "`counts`"

Storing "`timestamps`"

H5Py

Python particles simulator: out-of-core processing

Problem description

Question

What I would like it exist

What are the practical options

Tentative solution (pyTables)

PyTable Solution

Introduction to pytables data-structures

Arrays

Tables

How to store simulation parameters

Array for "emission"

Storing "counts"

Storing "timestamps"

H5Py

Array for "`emission`"

Storing "`counts`"

Storing "`timestamps`"