Question

I have a Azure table where customers post messages, there may be millions of messages in a single table. I want to find the fastest way of getting the messages posted within the last 10 minutes (which is how often I refresh the web page). Since only the partition key is indexed I have played with the idea of using the date & time the message was posted as a partition key, for example a string as a ISO8601 date format like "2009-06-15T13:45:30.0900000"

Example pseudo code:

var message = "Hello word!";
var messagePartitionKey = DateTime.Now.ToString("o");
var messageEntity = new MessageEntity(messagePartitionKey, message);
dataSource.Insert(messageEntity);

, and then query for the messages posted within the last 10 minutes like this (untested pseudo code again):

// Get the date and time 10 minutes ago
var tenMinutesAgo = DateTime.Now.Subtract(new TimeSpan(0, 10, 0)).ToString("o");

// Query for the latest messages
var latestMessages = (from t in
   context.Messages
   where t.PartitionKey.CompareTo(tenMinutesAgo) <= 0
   select t
   )

But will this be taken well by the index? Or will it cause a full table scan? Anyone have a better idea of doing this? I know there is a timestamp on each table item, but it is not indexed so it will be too slow for my purpose.

Was it helpful?

Solution

I think you've got the right basic idea. The query you've designed should be about as efficient as you could hope for. But there are some improvements I could offer.

Rather than using DateTime.Now, use Date.UtcNow. From what I understand instances are set to use Utc time as their base anyway, but this just makes sure you're comparing apples with apples and you can reliable convert the time back into whatever timezone you want when displaying them.

Rather than storing the time as .ToString("o") turn the time into ticks and store that, you'll end up with less formatting problems (sometimes you'll get the timezone specification at the end, sometimes not). Also if you always want to see these messages sorted from most recent to oldest you can subtract the number of ticks from the max number of ticks e.g.

var messagePartitionKey = (DateTime.MaxValue.Ticks - _contactDate.Ticks).ToString("d19");

It would also be a good idea to specify a row key. While it is highly unlikely that two messages will be posted with exactly the same time, it's not impossible. If you don't have an obvious row key, then just set it to be a Guid.

OTHER TIPS

The Primary key for Table is the combination of PartitionKey and RowKey(which forms a clustered index).

In your case, just go for RowKey instead of ParitionKey(provide a constant value for this).

You can also follow the Diagnostics approach, like for every ten minutes create a new Partition Key. But this approach is mainly for requirements like Archieving/Purging etc.,

I would suggest doing something similar to what Diagnostics API is doing with WADPerformanceCountersTable. There PartitionKey groups a number of timestamps into a single item. Ie: it rounds all timestamps into nearest few minutes (say, nearest 5 minutes). This way you do not have a limited amount of partition keys and yet are still able to do ranged queries on them.

So, for example, you can have a PartitionKey that maps to each timestamp that is rounded into 00:00, 00:05, 00:10, 00:15, etc.. and then converted to Ticks

  • From my understanding using partition key with exact equal "=" will be much faster than less than using "<" or "greater than ">.
  • Also make sure to put more efforts if we can get the unique combination of partition key and row key for your condition.
  • Also make sure that you do less unique combinations of partition keys values to avoid more partitions.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top