Question

Hello Oracles of StackOverflow,

First time I managed to ask a question on stack overflow, so feel free to throw your cabbages at me. (or correct the way I should be asking my question)

I have this problem. I'm using HDF5 to store massive quantities of cookie information.

My Data is structured in the following way:

CookieID -> Event -> Key_value Pair

There are multiple events for each cookieID. But only one key_value pair per event.

I'd like to know what the best way I should store this in a HDF5.

Currently, I'm storing each cookie as a seperate table within a group in the HDF5, using the cookieID as the name of the table. Unfortunately for me, with 10,000,000 cookies, HDF5 (or specifically PyTables) doesn't approve of this type of storage.

Specifically throwing this error:

/CookieData`` is exceeding the recommended maximum number of children (16384)

I'm wondering if you could recommend the best way of storing this information.

Should I create a flat table? Should I keep this method? Is there something else I can do?

Help is appreciated. Thanks for reading.

Was it helpful?

Solution

Several hours of research later, I've discovered that what I was attempting to do was categorically impossible.

The following link gives details as to the impossibility of using HDF5 with variable-length nested children.

I've decided to go with a flat file for the time being and hope that this is more efficient than a database store. The problem with a flat file in the end is that I have to replicate values in the file, which otherwise should not exist.

If anyone else has any better ideas it would be appreciated.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top