Question

I'm looking at the data reading recipes and examples in the Astyanax documentation. Some of them (i.e. query all rows with a callback) include

setRepeatLastToken(false)

Can someone explain what this is used for? When should I use it? It looks like it defaults to (true).

Link to javadoc: http://netflix.github.io/astyanax/javadoc/com/netflix/astyanax/query/AllRowsQuery.html#setRepeatLastToken(boolean)

The source code for com.netflix.astyanax.query.AllRowsQuery includes the following comment:

 * There are a few important implementation details that need to be considered.
 * This implementation assumes the random partitioner is used. Consequently the
 * KeyRange query is done using tokens and not row keys. This is done because
 * when using the random partitioner tokens are sorted while keys are not.
 * However, because multiple keys could potentially map to the same token each
 * incremental query to Cassandra will repeat the last token from the previous
 * response. This will ensure that no keys are skipped. This does however have
 * to very important implications. First, the last and potentially more (if they
 * have the same token) row keys from the previous response will repeat. Second,
 * if a range of repeating tokens is larger than the block size then the code
 * will enter an infinite loop. This can be mitigated by selecting a block size
 * that is large enough so that the likelyhood of this happening is very low.
 * Also, if your application can tolerate the potential for skipped row keys
 * then call setRepeatLastToken(false) to turn off this features.

I understand the query is done based on a token range instead of a key range. But why would rows potentially be skipped if the token wasn't repeated?

Was it helpful?

Solution

The source code comment pretty much explains the functionality of setRepeatLastToken(boolean). But here are the details:

According to this post, Cassandra uses the MD5 or MurMurHash (depending on Cassandra version) algorithm for generating tokens from the keys. Both of these algorithms are fast but can generate collisions (same token value for different keys). Because of that, there might be multiple rows stored under the same token (usually if the data set is large enough).

Cassandra stores the data on nodes based on the tokens. When using the random partitioner, the data retrieval is done in token order (not key order). This makes sense because the records are read from the same node(s) in sequence and generate less traffic than retrieving records from random nodes in the cluster.

When reading from Cassandra using Astyanax with paging, the page (block) size may correspond to the middle of a set of rows with the same token. When the request for the next page comes, Astyanax needs to know whether to start with the next token (an possibly miss the rest of the rows with the same token than didn't fit into the last page) or repeat last token to make sure all the rows from the last key are read (but repeating one or possibly more rows from previous page).

The code comment also warns that if the page size is small enough that only rows with the same token fit into it, the code may enter an infinite loop if setRepeatLastToken is set to true.

I hope this helps anyone else that might be wondering about this feature.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top