質問

I have data in the following form:

"blue red"
"blue magenta cyan"
"yellow red"
"black" 

The max number of elements in each row is 10 but there can be thousands of labels/categories/colors. I would like to insert this data somehow in a pytables column with the purpose of making queries in the form:

`label in row`

For example return all documents containing the blue label (result would be the first two rows). What would be the most efficient way to achieve this given the fact that Pytables doesn't have a set data type?

役に立ちましたか?

解決

The fact that you only have max N=10 is great. This means that doing these kinds of comparisons is possible. What you should do is have 10 string columns where each column is a label. If you have less than 10 labels for a row then you fill this in with blank strings.

This will allow you to write efficient query expressions that you can use in Table.where() and Table.read_where() commands [1]. Suppose the columns have the silly names 'col0', 'col1', etc. Because string comparison is exact in numexpr and because there is no native set type, you have to explicitly unroll equality comparisons:

cond = ("col0 == 'blue' | col1 == 'blue' | col2 == 'blue' | col3 == 'blue' | "
        "col4 == 'blue' | col5 == 'blue' | col6 == 'blue' | col7 == 'blue' | "
        "col8 == 'blue' | col9 == 'blue'")
rows = [row[:] for row in table.where(cond)]

Luckily, it is easy to programatically construct the cond string:

cond = " | ".join(["col{0} == 'blue'".format(i) for i in range(10)])

However, there is even more that you can do. String comparison is bulky and slow. This is because and all of your strings have to have the same size which means that you column size is determined by your longest label. This leads to a lot of wasted space. Instead, you should have a mapping to/from your labels integers. Then you can store the integers, compare on these very quickly. For example, using list indexes:

labels = ['', 'blue', 'red', 'yellow', ...]
labels_to_idx = dict(zip(labels, range(len(labels))))
cond = " | ".join(["col{0} == '{1}'".format(i, labels_to_idx['blue']) 
                   for i in range(10)])
rows = [[labels[x] for x in row[:]] for row in table.where(cond)]

You can even store the labels list in PyTables as an EArray so you are sure you always get the same index ordering while also being able to extend the list of allow labels.

Furthermore, since labels will be reused, especially the empty string label, I highly recommend that you enable compression.

Unfortunately, since columns are indexed (not tables), you can't index these queries.

With compression and mapping to/from integers, this is probably the fastest and smallest you can get.

  1. http://pytables.github.io/usersguide/libref/structured_storage.html?highlight=read_where#tables.Table.read_where
ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top