Problems with merging on-disk tables with millions of rows

https://stackoverflow.com/questions/17691912

03-06-2022
|

Question

TypeError: Cannot serialize the column [date] because its data contents are [empty] object dtype.

Hello SO! Currently have got two large HDFStore containing each one node, both the nodes doesn't fit in memory. The nodes don't contain NaN values. Now I would like to merge these two nodes using this. First tested for a small store where all the data fits in one chunk and this was working OK. But now for the case where it has to merge chunk by chunk and it's giving me the following error: TypeError: Cannot serialize the column [date], because its data contents are [empty] object dtype.

This is the code that I'm running.

>>> import pandas as pd
>>> from pandas import HDFStore
>>> print pd.__version__
0.12.0rc1

>>> h5_1 ='I:/Data/output/test8\\var1.h5'
>>> h5_3 ='I:/Data/output/test8\\var3.h5'
>>> h5_1temp = h5_1.replace('.h5','temp.h5')

>>> A = HDFStore(h5_1)
>>> B = HDFStore(h5_3)
>>> Atemp = HDFStore(h5_1temp)

>>> print A
<class 'pandas.io.pytables.HDFStore'>
File path: I:/Data/output/test8\var1.h5
/var1            frame_table  (shape->12626172)
>>> print B
<class 'pandas.io.pytables.HDFStore'>
File path: I:/Data/output/test8\var3.h5
/var3            frame_table  (shape->6313086)

>>> nrows_a = A.get_storer('var1').nrows
>>> nrows_b = B.get_storer('var3').nrows
>>> a_chunk_size = 500000
>>> b_chunk_size = 500000
>>> for a in xrange(int(nrows_a / a_chunk_size) + 1):
...     a_start_i = a * a_chunk_size
...     a_stop_i  = min((a + 1) * a_chunk_size, nrows_a)
...     a = A.select('var1', start = a_start_i, stop = a_stop_i)
...     for b in xrange(int(nrows_b / b_chunk_size) + 1):
...         b_start_i = b * b_chunk_size
...         b_stop_i = min((b + 1) * b_chunk_size, nrows_b)
...         b = B.select('var3', start = b_start_i, stop = b_stop_i)
...         Atemp.append('mergev13', pd.merge(a, b , left_index=True, right_index=True,how='inner'))

... 
Traceback (most recent call last):
  File "<interactive input>", line 9, in <module>
  File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 658, in append
    self._write_to_group(key, value, table=True, append=True, **kwargs)
  File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 923, in _write_to_group
    s.write(obj = value, append=append, complib=complib, **kwargs)
  File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 3251, in write
    return super(AppendableMultiFrameTable, self).write(obj=obj.reset_index(), data_columns=data_columns, **kwargs)
  File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 2983, in write
    **kwargs)
  File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 2715, in create_axes
    raise e
TypeError: Cannot serialize the column [date] because
its data contents are [empty] object dtype

Things that I noticed, it mentions that I'm on pandas_version:= '0.10.1', however my pandas version is 0.12.0rc1. Further more some more specific information of the nodes:

>>> A.select_column('var1','date').unique()
array([2006001, 2006009, 2006017, 2006025, 2006033, 2006041, 2006049,
       2006057, 2006065, 2006073, 2006081, 2006089, 2006097, 2006105,
       2006113, 2006121, 2006129, 2006137, 2006145, 2006153, 2006161,
       2006169, 2006177, 2006185, 2006193, 2006201, 2006209, 2006217,
       2006225, 2006233, 2006241, 2006249, 2006257, 2006265, 2006273,
       2006281, 2006289, 2006297, 2006305, 2006313, 2006321, 2006329,
       2006337, 2006345, 2006353, 2006361], dtype=int64)

>>> B.select_column('var3','date').unique()
array([2006001, 2006017, 2006033, 2006049, 2006065, 2006081, 2006097,
       2006113, 2006129, 2006145, 2006161, 2006177, 2006193, 2006209,
       2006225, 2006241, 2006257, 2006273, 2006289, 2006305, 2006321,
       2006337, 2006353], dtype=int64)

>>> A.get_storer('var1').levels
['x', 'y', 'date']

>>> A.get_storer('var1').attrs
/var1._v_attrs (AttributeSet), 12 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := ['date', 'y', 'x'],
    index_cols := [(0, 'index')],
    levels := ['x', 'y', 'date'],
    nan_rep := 'nan',
    non_index_axes := [(1, ['x', 'y', 'date', 'var1'])],
    pandas_type := 'frame_table',
    pandas_version := '0.10.1',
    table_type := 'appendable_multiframe',
    values_cols := ['values_block_0', 'date', 'y', 'x']]

>>> A.get_storer('var1').table
/var1/table (Table(12626172,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "date": Int64Col(shape=(), dflt=0, pos=2),
  "y": Int64Col(shape=(), dflt=0, pos=3),
  "x": Int64Col(shape=(), dflt=0, pos=4)}
  byteorder := 'little'
  chunkshape := (3276,)
  autoIndex := True
  colindexes := {
    "date": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "y": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "x": Index(6, medium, shuffle, zlib(1)).is_CSI=False}

>>> B.get_storer('var3').levels
['x', 'y', 'date']

>>> B.get_storer('var3').attrs
/var3._v_attrs (AttributeSet), 12 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := ['date', 'y', 'x'],
    index_cols := [(0, 'index')],
    levels := ['x', 'y', 'date'],
    nan_rep := 'nan',
    non_index_axes := [(1, ['x', 'y', 'date', 'var3'])],
    pandas_type := 'frame_table',
    pandas_version := '0.10.1',
    table_type := 'appendable_multiframe',
    values_cols := ['values_block_0', 'date', 'y', 'x']]

>>> B.get_storer('var3').table
/var3/table (Table(6313086,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "date": Int64Col(shape=(), dflt=0, pos=2),
  "y": Int64Col(shape=(), dflt=0, pos=3),
  "x": Int64Col(shape=(), dflt=0, pos=4)}
  byteorder := 'little'
  chunkshape := (3276,)
  autoIndex := True
  colindexes := {
    "date": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "y": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "x": Index(6, medium, shuffle, zlib(1)).is_CSI=False}

>>> print Atemp
<class 'pandas.io.pytables.HDFStore'>
File path: I:/Data/output/test8\var1temp.h5
/mergev13            frame_table  (shape->823446)

Since chunksize is 500000 and shape of node in Atemp is 823446, gives me that at least one chunk is merged. But I cannot figure out where the error is coming from and I also run out of clues trying to discover where it's exactly going wrong. Any help is very much appreciated..

EDIT

By reducing the chunksize of my test store it gives the same error. Of course not good, but now gives me possibility to share. Click here for the code + HDFStores.

Solution

The merged frame possibly will have no rows. Appending a len-zero frame is an error (though should be more imformative).

Check the len before appending

df = pd.merge(a, b , left_index=True, right_index=True,how='inner')

if len(df):
    Atemp.append('mergev46', df)

Results with your provided dataset

<class 'pandas.io.pytables.HDFStore'>
File path: var4.h5
/var4            frame_table  (shape->1334)
<class 'pandas.io.pytables.HDFStore'>
File path: var6.h5
/var6            frame_table  (shape->667)
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1334 entries, (928, 310, 2006001) to (1000, 238, 2006361)
Data columns (total 1 columns):
var4    1334  non-null values
dtypes: float64(1)
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 667 entries, (928, 310, 2006001) to (1000, 238, 2006353)
Data columns (total 1 columns):
var6    667  non-null values
dtypes: float64(1)
<class 'pandas.io.pytables.HDFStore'>
File path: var4temp.h5
/mergev46            frame_table  (shape->977)

You should close the files FYI when you are done with them

Closing remaining open files: var6.h5... done var4.h5... done var4temp.h5... done

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow