Question

I'm having a bit of an issue with my application functionality integrating with Cassandra. I'm trying to create a content feed for my users. Users can create posts which, in turn, have the field user_id. I'm using Redis for the entire social graph and using Cassandra columns solely for objects. In Redis, user 1 has a set named user:1:followers with all of his/her follower ids. These follower ids correspond with the Cassandra ids in the users table and user_ids in the posts table.

My goal was originally to simply plug all of the user_ids from this Redis set into a query that would use FROM posts WHERE user_id IN (user_ids here) and grab all of the posts from the secondary index user_id. The issue is that Cassandra purposely does not support the IN operator in secondary indexes because that index would force Cassandra to search ALL of its nodes for that value. I'm left with only two options I can see: Either create a Redis list of user:1:follow_feed for the post IDs then search Cassandra's primary index for those posts in a single query, or keep it the way I have it now and run an individual query for every user_id in the user:1:follower set.

I'm really leaning against the first option because I already have tons and tons of graph data in Redis, and this option would add a new list for every user. The second way is far worse. I would put a massive read load on Cassandra and it would take a long time to run individual queries for a set of ids. I'm kind of stuck between a rock and a hard place, as far as I see it. Is there any way to query the secondary indexes with multiple values? If not, is there a more efficient way to load these content feeds (RAM and speed wise) compared to the options of more Redis lists or multiple Cassandra queries? Thanks in advance.

Was it helpful?

Solution

Without knowing the schema of the posts table (and preferably the others, as well), it's really hard to make any useful suggestions.

It's unclear to me why you need to have user_id be a secondary index, as opposed to your primary key.

In general it's quite useful to key content like posts off of the user that created it, since it allows you to do things like retrieve all posts (optionally over a given range, assuming they are chronologically sorted) very efficiently.

With Cassandra, if you find that a table can effectively answer some of the queries that you want to perform but not others, you are usually best of denormalizing that table and creating another table with a different structure in order to keep your queries to a single CQL partition and node.

CREATE TABLE posts (
  user_id int,
  post_id int,
  post_text text,
  PRIMARY KEY (user_id, post_id)
  ) WITH CLUSTERING ORDER BY (post_id DESC)

This table can answer queries such as:

 select * from posts where user_id = 1234;

 select * from posts where user_id = 1 and post_id = 53;

 select * from posts where user_id = 1 and post_id > 5321 and post_id < 5400;

The reverse clustering on post_id is to make retrieving the most recent posts the most efficient by placing them at the beginning of the partition physically within the sstable.

In that example, user_id being a partition column, means "all cql rows with this user_id will be hashed to the same partition, and hence the same physical nodes, and eventually, the same sstables. That's why it's possible to

  1. retrieve all posts with that user_id, as they are store contiguously
  2. retrieve a slice of them by doing a ranged query on post_id
  3. retrieve a single post by supplying both the partition column(user_id) and the clustering column (post_id)

In effect, this become a hashmap of a hashmap lookup. The one major caveat, though, is that when using partition and clustering columns, you always need to supply all columns from left to right in your query, without skipping any. So in this case, that means you can't retrieve an individual post without knowing the user_id that the post_id belongs to. That is addressable in user-code(by storing a reverse mapping and doing the lookup when necessary, or by encoding the user_id into the post_id that is passed around your application), but is definitely something to take into consideration.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top