سؤال

With zero experience designing non-relational databases (Azure Storage Tables, to be specific), I'm having trouble coming up with a good design to store the data for my application.

The application is really simple. It is basically a multi-user To-Do list:

User selects a "Procedure".
User gets presented with webpage with several checkboxes.
User starts checking checkboxes.
Each check/uncheck gets stored in the DB.

For example, let's say that we have a procedure to obtain Milk:

Procedure 1 - How to obtain Milk:
    [_] Step 1 - Open fridge
    [_] Step 2 - Get Milk
    [_] Step 3 - Close fridge

Alice decides to execute this procedure, so she creates a new execution and starts checking checkboxes:

Procedure 1, Execution 1:
    Executor(s): Alice
    [X] Step 1 - Open fridge
    [X] Step 2 - Get Milk
    [_] Step 3 - Close fridge

Bob, also decides to execute this procedure, but not together with Alice. So, Bob creates a new execution. Charlie, on the other hand wants to help Bob, so instead of creating a new execution he joins Bob's execution:

Procedure 1, Execution 2:
    Executor(s): Bob, Charlie
    [_] Step 1 - Open fridge
    [X] Step 2 - Get Milk
    [_] Step 3 - Close fridge

In summary, we can have multiple procedures, and each procedure can have multiple executions:

Procedure Execution relationship

So, we need a way to store procedures (a list of checkboxes); executions (who, when, checkboxes states); and the history of checks/unchecks.

This is what I have come up with so far:

  • Create three tables: Procedures, Executions, Actions.
  • The Procedures table stores what checkboxes are there in each procedure.
  • The Executions table stores who and when initiated the execution of a Procedure, and the checkboxes states.
  • The Actions table stores every checkbox check and uncheck, including who and when.

I'm not too happy with this approach for a number of reasons. For instance, every time a user clicks on a checkbox we need to update the Executions table row and insert a new row into the Actions table at the same time. Also, I'm not sure if this design will scale for a really large number of Procedures, Executions, and Actions.

What would be a good way to store this data using Azure Storage Tables, or a similar NoSQL store? How would you go about designing this database? And, how would you partition the data (row keys, partition keys)?

هل كانت مفيدة؟

المحلول

First, you don't need to coerce Azure tables into a relational structure. They're very fast and very cheap, and designed so you can dump blocks of data in and worry about the structure when you retrieve it.

Second, correctly identifying and structuring your partition keys makes retrieval even faster.

Third, Azure tables don't have to have uniform structures. You can store different kinds of data within one table, even with the same partition keys. This opens up possibilities not available to an RDBMS.

So how are you planning to retrieve the data? What are the use cases?

Let's say your primary use case is to retrieve the data by time, like an audit log. In that case, I would suggest this approach:

  • Put your procedures, executions, and actions all within the same table.
  • Create a new table for each unit of time that gives you tens of thousands to hundreds of thousands of rows per table, or some other unit that makes sense. (For one project I've done recently, the application's event log uses one table per month, with each table growing to around 100,000 rows.)
  • Create a partition key that gives you hundreds to thousands of rows per partition. (We use hours remaining until DateTimeOffset.MaxValue. When you query an Azure table without using a partition key, you see the lowest partitions first. This descending-hourly scheme means the most recent hour's entries are at the top of the results pane in our Azure tool.)
  • Structure your row keys to be human-readable. Remember they need to be unique within the table. So possibly a row key like Procedure_Bob_ID12345_20140514-134630Z_unique where unique is a counter or hash would work.
  • When you query for data, pull back the entire partition--remember, it's just a few hundred rows--and filter the results in memory, where it's faster.

Say you have a second use case where you need to retrieve data by user name. Simple: within the same table, add a second row containing the same data but with a partition key based on the user name (bob_execution_20140514).

Another thing to consider is storing the entire procedure etc. object graphs in the table. Getting back to our logging example, a log entry might have detailed information, so we just plop an entire block of JSON right in the table. (We're usually retrieving it in an Azure cloud service, so the network throughput isn't a meaningful constraint as Azure-to-Azure speeds within the same region are gigabits per second.)

نصائح أخرى

Depending on usage approach use either Procedure ID or a combination of ProcedureID-ExecutionID. Don't worry about building a quasi-relational model - just choose the right partition key based on how you are most likely to create or consume the data in the majority of cases (i.e. will you care more about procedures, executions, assignees or steps in the longer term and how might you retrieve all items related to a single entity such a procedure in a single query?)

Depending on volume of steps in a procedure you might not even care too much about how step values are tracked (maybe using an integer or enum that could be combined via a bitwise operator?) see - Most common C# bitwise operations on enums

The selection of PK, RK and other table properties depends on how you are going to use the data, your dominant query and application behavior. The storage team blob (http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx) has guidance on this for common scenarios.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top