Azure Tables - Partition Key and Row Key - Correct Choice

Question 1

Your design has to be related to your query. You can filter your data based on 2 columns PartitionKey and RowKey. PartitionKey is your most important column since your queries will hit that column first.

In your case CustomerId should be your PartitionKey since most of the time you will try to reach your data based on the customer. (you may also need to keep another table for your client list)

Now, RowKey can be your tripId or time. if I were you I probably use rowKey as yyyyMMddHHmm|tripId format which will let you to query based on startWith and endWidth options.

Question 2

The thing about the Partition Key is that it represents a logical grouping; You cannot insert data spanning multiple partition keys, for example. Similarly, rows with the same partition are likely to be stored on the same server, making it quick to retrieve all the data for a given partition key.

As such, it is important to look at your domain and figure out what aggregate you are likely to work with.

If I understand your domain model correctly, I would actually be tempted to use the TripId as the Partition Key and the JourneyStep as the Row Key. You will need to, separately, have a table that lists all the Trip IDs that belongs to a given Customer - which sort of makes sense as you probably want to store some data, such as "trip name" etc in such a table anyway.

Question 3

Adding to @Frans answer:

One thing you could do is create a separate table for each customer. So you could have table named like Customer. That way each customer's data is nicely segregated into different tables. Then you could use TripId as PartitionKey and then JourneyStep as RowKey as suggested by @Frans. For storing some metadata about the trip, instead of going into a separate table, I would still use the same table but here I would keep the RowKey as empty and put other information about the trip there.

Question 4

I would suggest considering the following approach to your PK/RK design. I believe it would yield the best performance for your outlined queries:

PartitionKey: combination of CustomerId and TripId.

string.Format("{0}_{1}", customerId.ToString(), tripId.ToString())

RowKey: combination of the DateTime.MaxValue.Ticks - Time.Ticks formatted to a large 0-padded string with the JourneyStep.

string.Format("{0}_{1}", (DateTime.MaxValue.Ticks - Time.Ticks).ToString("00000000000000000"), JourneyStep.ToString())

Such combination will allow you to do the following queries "quickly".

Get data by CustomerId only. Example: context.Trips.Where(n=>string.Compare(id + "_00000000-0000-0000-0000-000000000000", n.PartitionKey) <= 0 && string.Compare(id+"_zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz") >=0).AsTableServiceQuery(context);
Get data by CustomerId and TripId. Example: context.Trips.Where(n=>n.PartitionKey == string.Format("{0}_{1}", customerId, tripId).AsTableServiceQuery(context);
Get last X amount of journey steps if you were to search by either CustomerId or CustomerId/TripId by using the "Take" function
Get data via date-range queries by translating timestamps into Ticks
Save data into a trip with a single storage transaction (assuming you have less than 100 steps)

If you can guarantee uniqueness of Times of Steps within each Trip, you don't even have to put JourneyStep into the RowKey as it is somewhat inconvenient

The only downside to this schema is not being able to retrieve a particular single journey step without knowing its Time and Id. However, unless you have very specific use cases, downloading all of the steps inside a trip and then picking a particular one from the list should not be so bad.

HTH

Question 5

The design of table storage is a function to optimize two major capabilities of Azure Tables:

Scalability
Search performance

As @Frans user already pointed out, Azure tables uses the partitionkey to decide how to scale out your data on multiple storage server nodes. Because of this, I would advise against having unique partitionkeys, since in theory, you will have Azure spanning out storage nodes that will be able to serve one customer only. I say "in theory" because, in practice, Azure uses smart algorithms to identify if there are patterns in your partitionkeys and thus be able to group them (example, if your ids are consecutive numbers). You don't want to fall into this scenario because the scalability of your storage will be unpredictable and at the hands of obscure algorithms that will be making those decisions. See HERE for more information about scalability.

Regarding performance, the fastest way to search is to hit both partitionkey+rowkey in your search queries. Contrary to Amazon DynamoDB, Azure Tables does not support secondary column indexes. If you have your search queries search for attributes stored in columns apart from those two, Azure will need to do a full table scan.

I faced a situation similar to yours, where the design of the partition/row keys was not trivial. In the end, we expanded our data model to include more information so we could design our table in such a way that ~80% of all search queries can be matched to partition+row keys, while the remaining 20% require a table scan. We decided to include the user's location, so our partition key is the user's country and the rowkey is a customer unique ID. This means our data model had to be expanded to include the user's country, which was not a big deal. Maybe you can do the same thing? Group your customers by segment, or by location, or by email address SMTP domain?