PyTables dtype alignment issue

https://stackoverflow.com/questions/21926238

14-10-2022
|

Question

Consider the following code:

import os
import numpy as np
import tables as tb

# Pass the field-names and their respective datatypes as
# a description to the table
dt = np.dtype([('doc_id', 'u4'), ('word', 'u4'), 
    ('tfidf', 'f4')], align=True)

# Open a h5 file and create a table
f = tb.openFile('corpus.h5', 'w')
t = f.createTable(f.root, 'table', dt, 'train set',
    filters=tb.Filters(5, 'blosc'))

r = t.row
for i in xrange(20):
    r['doc_id'] = i
    r['word'] = np.random.randint(1000000)
    r['tfidf'] = rand()
    r.append()
t.flush()

# structured array from table
sa = t[:]

f.close()
os.remove('corpus.h5')

I have passed in an aligned dtype object, but when I observe sa, I get the following:

print dt
print "aligned?", dt.isalignedstruct
print
print sa.dtype
print "aligned?", sa.dtype.isalignedstruct

>>> 

    {'names':['doc_id','word','tfidf'], 'formats':['<u4','<u4','<f4'], 'offsets':[0,4,8], 'itemsize':12, 'aligned':True}
    aligned? True

    [('doc_id', '<u4'), ('word', '<u4'), ('tfidf', '<f4')]
    aligned? False

The structured array is not aligned. Is there no current way to enforce alignment in PyTables, or what am I doing wrong?

Edit: I've noticed my question is similar to this one, but I've copied and tried its provided answer, but it doesn't work either.

Edit2: (See Joel Vroom's answer below)

I've replicated Joel's answer and tested to see if it is truly unpacked through Cython. Turns out it is:

In [1]: %load_ext cythonmagic

In [2]: %%cython -f -c=-O3
   ...: import numpy as np
   ...: cimport numpy as np
   ...: import tables as tb
   ...: f = tb.openFile("corpus.h5", "r")
   ...: t = f.root.table
   ...: cdef struct Word: # notice how this is not packed
   ...:     np.uint32_t doc_id, word
   ...:     np.float32_t tfidf
   ...: def main(): # <-- np arrays in Cython have to be locally declared, so put array in a function
   ...:     cdef np.ndarray[Word] sa = t[:3]
   ...:     print sa
   ...:     print "aligned?", sa.dtype.isalignedstruct
   ...: main()
   ...: f.close()
   ...: 
[(0L, 232880L, 0.2658001184463501) (1L, 605285L, 0.9921777248382568) (2L, 86609L, 0.5266860723495483)]
aligned? False

La solution

Currently there is no way to align data in PyTables :(
In practice I have done one of two things to get around this:

I perform one extra step --> np.require(sa, dtype=dt, requirements='ACO') or
I arrange the fields in my dtype description such that they are aligned.

As an example for the 2nd option, suppose I have the following dtype:
dt = np.dtype([('f1', np.bool),('f2', '<i4'),('f3', '<f8')], align=True)

If you print dt.descr you will see that a void space has been added to align the data:
dt.descr >>> [('f1', '|b1'), ('', '|V3'), ('f2', '<i4'), ('f3', '<f8')]

But, if I ordered my dtype like this (largest to smallest bytes):
dt = np.dtype([('f3', '<f8'), ('f2', '<i4'), ('f1', np.bool)])
The data is now aligned regardless of whether I specify align = True/False.

Someone please correct me if I am wrong but even though dt.isalignedstruct = False if it has been ordered as shown above it is technically aligned. This has worked for me in applications where I need to send aligned data to C.

In the example you provided, even though sa.dtype.isalignedstruct = False given that
dt.descr = [('doc_id', '<u4'), ('word', '<u4'), ('tfidf', '<f4')] and
sa.dtype.descr = [('doc_id', '<u4'), ('word', '<u4'), ('tfidf', '<f4')]
The sa array is aligned (no void spaces added to the descr).

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow