Creating a custom index on a collection using CQL 3.0

https://stackoverflow.com/questions/20434036

30-08-2022
|

Question

I have been looking at the CQL 3.0 data modelling documentation which describes a column family of songs with tags, created like this:

CREATE TABLE songs (
    id uuid PRIMARY KEY,
    title text,
    tags set<text>
);

I would like to get a list of all songs which have a specific tag, so I need to add an appropriate index.

I can create an index on the title column easily enough, but if I try to index the tags column which is a collection, like this:

CREATE INDEX ON songs ( tags );

I get the following error from the DataStax Java driver 1.0.4:

Exception in thread "main" com.datastax.driver.core.exceptions.InvalidQueryException: Indexes on collections are no yet supported
at com.datastax.driver.core.exceptions.InvalidQueryException.copy(InvalidQueryException.java:35)
at com.datastax.driver.core.ResultSetFuture.extractCauseFromExecutionException(ResultSetFuture.java:269)

It looks like this may be fixed in a later version of Cassandra (2.1) according to JIRA issue CASSANDRA-4511. I am currently using Apache Cassandra 1.2.11 however, and do not want to upgrade yet. According to issue CASSANDRA-5615 though, in Cassandra 1.2.6 there is support for custom indexes on collections.

The problem is, the only documentation available states:

Cassandra supports creating a custom index, which is for internal use and beyond the scope of this document.

But, it does suggest the following syntax:

CREATE CUSTOM INDEX ON songs ( tags ) USING 'class_name';

What is the class_name that is specified in this CQL statement?

Is there a better way of indexing the tags so that I can query the songs table for a list of songs that have a specific tag?

La solution

The way you are trying to do this isn't the best way to model it within Cassandra in my view. You build models based on your queries, not your data. If you need to find songs based by tag, then you make another table for that and duplicate the data. Something like ...

CREATE TABLE tagged_songs (
  tag varchar,
  song_id uuid,
  song_title varchar,
  ... anything else you might need with your songs here ...
  PRIMARY KEY ((tag), song_id)
);

The premise in Cassandra is that storage is cheap. Duplicate your data to meet your queries. Writes are fast, and writing the same data 3,4,10 times is normally fine.

You also want to store your song title and any other info you need into this table. You don't want to grab a load of IDs and try join on it when reading. This isn't a relational DB.

When someone tags a song, you might want to insert the tag into the set as you have it as present, AND add it to the tagged_songs table too. Querying for all songs with tag X is then basically O(1).

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow