Cassandra (Pycassa/CQL) Return Partial Match

https://stackoverflow.com/questions/10048506

29-05-2021
|

Question

I'm trying to do a partial search through a column family in Cassandra similar to an SQL query like: SELECT * FROM columnfamily WHERE col = 'val*' where val* means any value matching at least the first three characters 'val'.

I've read datastax's documentation on the SELECT function, but can't seem to find any support for the partial WHERE criteria. Any ideas?

Solution

There is no wildcard support like this in Cassandra, but you can model your data in such a way that you could get the same end result.

You would take the column that you want to perform this query on and denormalize it into a second column family. This CF would have a single wide row with the column name as the value of the col you want to do the wild card query on. The column value for this CF could either be the row key for the original CF or some other representation of the original row.

Then you would use slicing to get out the values you care about. For example if this was the wide row to slice on:

 +---------+----------+--------+----------+---------+--------+----------+
 |  RowKey | aardvark | abacus | abacuses | abandon | accent | accident |
 |         +----------+--------+----------+---------+--------+----------+
 |         |          |        |          |         |        |          |
 |         |          |        |          |         |        |          |
 +---------+----------+-----------------------------+--------+----------+

Using CQL you could select out everything starting with 'aba*' using this query*:

  SELECT 'aba'..'abb' from some_cf where RowKey = some_row_key;

This would give you the columns for 'abacus', 'abacuses', and 'abandon'.

There are some things to be aware of with this strategy:

In the above example, if you have things with the same column_name you need to have some way to differentiate between them (otherwise inserting into the wide column family will clobber other valid values). One way that you could do this is by using a composite column of word:some_unique_value.
The above model only allows wild cards at the end of the string. Wild cards at the beginning of the string could also easily be handled with a few modifications. Wild cards in the middle of a string would be much more challenging.

Remember that Cassandra doesn't give you an easy way to do ad-hoc queries. Instead you need to figure out how you will be using the data and model your CFs accordingly. Take a look at this blog post from Ed Anuff on indexing data in Cassandra for more info on modeling data like this.

*Note that the CQL syntax for slicing columns is changing in an upcoming release of Cassandra.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow