How do you create a compressed dataset in pytables that can store a Unicode string?

Question 1

PyTables does not natively support unicode - yet. To store unicode. First convert the string to bytes and then store a VLArray of length-1 strings or uint8. To get compression simply instantiate your array with a Filters instance that has a non-zero complevel.

All of the examples I know of storing JSON data like this do so using the HDF5 C-API.

Question 2

OK, based on Anthony Scopatz's approach, I have a feasible solution.

def recordStringInHDF5(h5file, group, nodename, s, complevel=5, complib='zlib'):
    '''creates a CArray object in an HDF5 file 
    that represents a unicode string'''

    bytes = np.fromstring(s.encode('utf-8'),np.uint8)
    atom = pt.UInt8Atom()
    filters = pt.Filters(complevel=complevel, complib=complib)
    ca = h5file.create_carray(group, nodename, atom, shape=(len(bytes),),
                               filters=filters)
    ca[:] = bytes
    return ca
def retrieveStringFromHDF5(node):
    return unicode(node.read().tostring(), 'utf-8')

If I run this:

>>> h5file = pt.openFile("test1.h5",'w')
>>> recordStringInHDF5(h5file, h5file.root, 'mrtamb',
    u'\u266b Hey Mr. Tambourine Man \u266b')

/mrtamb (CArray(30,), shuffle, zlib(5)) ''
  atom := UInt8Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'irrelevant'
  chunkshape := (65536,)

>>> h5file.flush()
>>> h5file.close()
>>> h5file = pt.openFile("test1.h5")
>>> print retrieveStringFromHDF5(h5file.root.mrtamb)

♫ Hey Mr. Tambourine Man ♫

I've been able to run this with strings in the 300kB range and gotten good compression ratios.