Set type in pytables

Question

The fact that you only have max N=10 is great. This means that doing these kinds of comparisons is possible. What you should do is have 10 string columns where each column is a label. If you have less than 10 labels for a row then you fill this in with blank strings.

This will allow you to write efficient query expressions that you can use in Table.where() and Table.read_where() commands [1]. Suppose the columns have the silly names 'col0', 'col1', etc. Because string comparison is exact in numexpr and because there is no native set type, you have to explicitly unroll equality comparisons:

cond = ("col0 == 'blue' | col1 == 'blue' | col2 == 'blue' | col3 == 'blue' | "
        "col4 == 'blue' | col5 == 'blue' | col6 == 'blue' | col7 == 'blue' | "
        "col8 == 'blue' | col9 == 'blue'")
rows = [row[:] for row in table.where(cond)]

Luckily, it is easy to programatically construct the cond string:

cond = " | ".join(["col{0} == 'blue'".format(i) for i in range(10)])

However, there is even more that you can do. String comparison is bulky and slow. This is because and all of your strings have to have the same size which means that you column size is determined by your longest label. This leads to a lot of wasted space. Instead, you should have a mapping to/from your labels integers. Then you can store the integers, compare on these very quickly. For example, using list indexes:

labels = ['', 'blue', 'red', 'yellow', ...]
labels_to_idx = dict(zip(labels, range(len(labels))))
cond = " | ".join(["col{0} == '{1}'".format(i, labels_to_idx['blue']) 
                   for i in range(10)])
rows = [[labels[x] for x in row[:]] for row in table.where(cond)]

You can even store the labels list in PyTables as an EArray so you are sure you always get the same index ordering while also being able to extend the list of allow labels.

Furthermore, since labels will be reused, especially the empty string label, I highly recommend that you enable compression.

Unfortunately, since columns are indexed (not tables), you can't index these queries.

With compression and mapping to/from integers, this is probably the fastest and smallest you can get.

http://pytables.github.io/usersguide/libref/structured_storage.html?highlight=read_where#tables.Table.read_where