How to design tables in Cassandra, where rows have to be looked up by list<varchar>?

https://stackoverflow.com//questions/24017395

21-12-2019
|

Question

given i have the following Objects to persist in Cassandra:

Array of Foo:

{
    "id":1,
    "name": "this is a name",
    "bundleFields" : [
        "bundleByMe",
        "me2",
        "me also",
    ]
},
{
    "id":2,
    "name": "anotherName",
    "bundleFields" : [
        "bundleByMe",
        "me2",
        "me also",
    ]
},
{
    "id":3,
    "name": "thridName",
    "bundleFields" : [
        "differentBundleCriteria"
    ]
}

I wanna query something like SELECT * FROM FOO where bundleFields = ["...", "..."].

This obviously does not work, since queries by list<> are not possible (no Primarykey).

This is the Schema i currently have:

CREATE TABLE IF NOT EXISTS Foo (
    id int,
    name varchar,
    bundleFields list<varchar>,
    PRIMARY KEY(id)
);

The only solution i can imagine is another table where the PRIMARY KEY contains the concatenated values of the bundleFields-Array, which would allow a lookup by the bundleString:

CREATE TABLE IF NOT EXISTS fooByBundleString (
    bundleString varchar,
    fooId int,
    PRIMARY KEY(bundleString)
);

Is this the recomended approach to this problem in cassandra.

The idea of having to serialize/deserialize the bundleFields-array does not feel "right" to me.

Thanks for advice!

Edit: As @rs_atl suggested the correct DDL for table fooByBundleString should be (note additional fooId in PRIMARY KEY):

CREATE TABLE IF NOT EXISTS fooByBundleString (
    bundleString varchar,
    fooId int,
    PRIMARY KEY(bundleString, fooId)
);

to create a covering-Index, since otherwise it would not be possible to store the same bundleString for different fooId's.

Solution

Creating an index as you've described is the correct solution. However it should be a covering index, meaning you'll want to duplicate any values you actually need returned in your query. Otherwise you'll end up doing a distributed join in your application, which is very expensive. In general, prefer denormalized data models to normalized relational models. This is essentially the same thing you have to do in your favorite RDBMS to make queries fast. The difference is you have to manage the index in your application; Cassandra won't do it for you.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow