質問

Let's say I have a table (let's call it BigTable) which could experience 5,000,000 INSERTS per day (with possibly just as many SELECTs). Each row inserted is about 50kb.

These daily INSERTs are split across 5 clients equally (the table has a FK called ClientID). There is never a need to SELECT or JOIN data across multiple clients.

I am worried about the database performance as this table grows, so I have come up with two solutions.

SOLUTION 1:

  • Partition BigTable by ClientID
  • Store each partition on a separate hard disk on the server (using Azure blog storage).
  • Partition all data which is 1 month old (archive data, yet still need to be queryable) into another set of READONLY partitions.

Essentially this means the following partitions on their own storage devices:

  • Primary (all data excluding BigTable)
  • ClientA's BigTable (5,000,000 rows per day / 5 clients x 30 days = 30,000,000 rows)
  • ClientB's BigTable (30,000,000 rows)
  • ClientC's BigTable (30,000,000 rows)
  • ClientD's BigTable (30,000,000 rows)
  • ClientE's BigTable (30,000,000 rows)
  • ClientA's BigTable archive
  • ClientB's BigTable archive
  • ClientC's BigTable archive
  • ClientD's BigTable archive
  • ClientE's BigTable archive

The number of rows in the archive tables will be (5,000,000) x (age of DB in days) - (30,000,000). This is still a huge table, but will only be used to drawing up the odd report.

SQL Server will be hosted on a 14GB, 8core Azure VM.

SOLUTION 2:

The other option is to host separate databases for each client. This means each will have it's own dedicated SQL Server machine. Partitioning will still happen for archive data.

This option is not optimal because of the physical separation of the data. Having to manage updates to multiple databases could be very problematic. Having separate database connections for each client will also be a consideration for the developers.

Could anyone perhaps advise on these options?

役に立ちましたか?

解決

Since you have tagged this with [azure] and [sql-server], I assume that you are trying to do this in Windows Azure. If that is the case then a) partitioning by client is not necessarily a good idea, and b) SQL may not be the best (complete) fit for your problem.

When building scalable architectures, the partitioning strategy shouldn't be based on something specific like 'client', but rather something more arbitrary. The reason is simple — unless clients have a reason to be separate, such as not wanting their data mixed with others, or different SLAs per client, then the choice of 'client' as a partition may not render the optimal result. If 80% of your business is generated by a single client, you have not solved your problem, and still have to maintain n separate databases for marginal load.

5 mil database inserts per day is not a big number, but may be a big number for SQL Server hosted in Azure IaaS, or Azure SQL Database — due to the performance of the underlying commodity hardware. Before determining how to partition SQL, ask yourself two questions. First, what are the usages and performance characteristics that you want from the data? (Does it have to be immediately consistent? Can you process data asynchronously?) Secondly, have you mapped those characteristics against other data store technologies? Have you considered Table Storage (or non-MS solutions like Redis)?

You may find, after trying out a few options that:

  • SQL is a good store for some of the data, some of the time.
  • Much of the processing can be done asynchronously, so the peak performance of inserts is of little concern (and doing 5 mil inserts over a 24-hour period is not a problem).
  • SQL may not be suited to long-term storage.
  • Querying of older data can be done effectively using map-reduce, rather than SQL queries.

For example, I have an app that tracks vehicles at one-second intervals. It is targeted for 100,000 vehicles but architected in such a way as to be able to scale up to millions without changing any code or databases. But in the medium term, it has to cope with 72 mil inserts per day. All of this runs on a single Windows Azure SQL database that is less than 10GB, and a whole bunch of table storage. The reason why this works is because although I want to archive all data (72 mil rows), I don't need complex SQL query access to it, so it sits happily in table storage. What I store in SQL is a summary of the data. So in my example, I am only interested in a vehicle's journey (start and end position, distance travelled etc), which means that I only have two or three rows per vehicle per day that I need in SQL — greatly reducing the load on the database. Also, my bottleneck is in the collection of data, so I add the data immediately to a (Windows Azure) queue — and worry about the summarising of data in a separate workload.

This answer may be a bit long, but is intended for you to think about your data model more carefully, rather than just trying to think about how to solve the problem with SQL. For more detail, have a look at the data model in CALM.

他のヒント

A few ideas for you: 1) Use Azure Tables instead of SQL. Have PartitionKey = ClientID. Each table can be 200TB & support 20k IOPS. Each partition is 2k IOPS, since the clients would be logically separated you would get a natural load balance (Azure load balances by Partition). This would also save you from having to run/manage a XL VM 24x7 (i.e. much cheaper). The storage costs would be the same since the data drive for the VM is backed by Azure storage anyway. 5M inserts per day is only ~60/sec, so there would be a lot of headroom for growth. This is especially true b/c you are doing fairly simplistic Insert/Select and would not cross Client boundaries.

2) If you want to do a per client DB, I would go with SQL Azure. The provisioning is much faster and each DB is a separate scale unit (this would prevent one client from creating an issue for others). You can also dynamically change the DB based on the client changes.

3) If you want a single monolithic DB, I would go with SQL Server on a VM. Create multiple Data drives and mount them as a striped set. For an XL VM, you can have up to 16 drives mapped. This will limit the maximum size of the DB to 16TB, so you would need to have some mechanism for aging out/archiving/garbage collecting.

You do not only need to think about access performance but also about disaster recovery performance. At 6TB for each client in the active month alone, I strongly recommend to keep the clients in separate databases.

If you have a decent Continuos Integration and an automated deploy process, keeping the database schema in synch should not be that big of a problem.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top