Azure Table Storage - How fast can I table scan?

https://stackoverflow.com/questions/4831989

27-10-2019
|

Question

Everyone warns not to query against anything other than RowKey or PartitionKey in Azure Table Storage (ATS), lest you be forced to table scan. For a while, this has paralyzed me into trying to come up with exactly the right PK and RK and creating pseudo-secondary indexes in other tables when I needed to query something else.

However, it occurs to me that I would commonly table scan in SQL Server when I thought appropriate.

So the question becomes, how fast can I table scan an Azure Table. Is this a constant in entities/second or does it depend on record size, etc. Are there some rules of thumb as to how many records is too many to table scan if you want a responsive application?

Solution

The issue of a table scan has to do with crossing the partition boundaries. The level of performance you are guaranteed is explicity set at the partition level. therefore, when you run a full table scan, its a) not very efficient, b) doesn't have any guarantee of performance. This is because the partitions themselves are set on seperate storage nodes, and when you run a cross partition scan, you're consuming potentially massive amounts of resources (tieing up multiple nodes simultaneously).

I believe, that the effect of crossing these boundaries also results in continuation tokens, which require additional round-trips to storage to retrieve the results. This results then in reducing performance, as well as an increase in transaction counts (and subsequently cost).

If the number of partitions/nodes you're crossing is fairly small, you likely won't notice any issues.

But please don't quote me on this. I'm not an expert on Azure Storage. Its actually the area of Azure I'm the least knowledgeable about. :P

OTHER TIPS

I think Brent is 100% on the money, but if you still feel you want to try it, I can only suggest to run some tests to find out yourself. Try include the partitionKey in your queries to prevent crossing partitions because at the end of the day that's the performance killer.

Azure tables are not optimized for table scans. Scanning the table might be acceptable for a long-running background job, but I wouldn't do it when a quick response is needed. With a table of any reasonable size you will have to handle continuation tokens as the query reaches a partition boundary or obtains 1k results.

The Azure storage team has a great post which explains the scalability targets. The throughput target for a single table partition is 500 entities/sec. The overall target for a storage account is 5,000 transactions/sec.

The answer is Pagination. Use the top_size -- max number of results or records in result -- in conjunction with next_partition_key and next_row_key the continuation tokens. That makes a significant even factorial difference in performance. For one, your results are statistically more likely to come from a single partition. Plain results show that sets are grouped by the partition continuation key and not the row continue key.

In other words, you also need to think about your UI or system output. Don't bother returning more than 10 to 20 results max 50. The user probably wont utilize or examine any more.

Sounds foolish. Do a Google search for "dog" and notice that the search returns only 10 items. No more. The next records are avail for you if you bother to hit 'continue'. Research has proven that almost no user ventures beyond that first page.

the select (returning a subset of the key-values) may make a difference; for example, use select = "PartitionKey,RowKey" or 'Name' whatever minimum you need.

"I believe, that the effect of crossing these boundaries also results in continuation tokens, which require additional round-trips to storage to retrieve the results. This results then in reducing performance, as well as an increase in transaction counts (and subsequently cost)."

...is slightly incorrect. the continuation token is used not because of crossing boundaries but because azure tables permit no more than 1000 results; therefore the two continuation tokens are used for the next set. default top_size is essentially 1000.

For your viewing pleasure, here's the description for queries entities from the azure python api. others are much the same.

  '''
  Get entities in a table; includes the $filter and $select options. 

  table_name: Table to query.
  filter: 
     Optional. Filter as described at 
     http://msdn.microsoft.com/en-us/library/windowsazure/dd894031.aspx
  select: Optional. Property names to select from the entities.
  top: Optional. Maximum number of entities to return.
  next_partition_key: 
     Optional. When top is used, the next partition key is stored in
     result.x_ms_continuation['NextPartitionKey']
  next_row_key: 
     Optional. When top is used, the next partition key is stored in
     result.x_ms_continuation['NextRowKey']
  '''

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow