Question

Say everything is optimized, queries are simple, indexes are good, etc.

Should a SQL (Azure) database where you spend about 150 to 300 euros per month easily handle data reading with about 250-500 mb per second on a table of ~100 million rows or is that getting close to the limits and should I look at other solutions?

I understand this is a very broad and general question, but I'm looking for something like very rough estimates. The problem is that I have a very complicated legacy project and I am clearly hitting some limits.

I tried (after consulting Azure experts) various other data solutions in the past to only hit other bottlenecks after I migrated the project and had the real production traffic coming along again.

The site has extreme peak traffic in a very short period of time. So it will be on 5% of the database (eDTU) consumption the whole week with about one or two times a day extreme traffic for half an hour. About +50.000 very active concurrent users with a lot of dynamic data being written and mainly requested. The data can't be stale.

So I'm just looking for a very rough estimated guess so I know if I need to move into a really different direction like i.e. memory-cached microservices.

If 500 mb/second should be no problem then I will try to explore trying to layout my data differently or other SQL based solutions.

Was it helpful?

Solution

So measuring if a server can handle your specific workload by the amount of money it costs when weighing different implementation options is as as much of a moot point as measuring how much a car costs when comparing their MPG between different cars to figure out if they can drive you from point A to point B.

My analogy above is meant to convey that we can't really provide a meaningful solution based on $ spent since there are different types of server configurations for the same amount of money, and the same amount of money spent on the server will generally result in the same output performance of any implementation you choose (RDBMS, NoSQL, memory-cached, alternative DB system, etc) for equivalent problems respective of their domain. It is not qualitative enough for an accurate answer to be provided.

That being said, knowing things like how frequently the data is changing, how frequently new data is being added, total size of the Tables and database, and how concurrently active your database will be help paint the picture, which you've provided some of already.

Other things that will be key in helping clarify a good solution for you are what kind of data is it and how is your data structured currently? (E.g. is it highly relational, is it not very concretely defined, is it mixed.) What kind of queries are typical during your off-peak and peak concurrent time-frames? Examples are very helpful here in communicating what your data looks like and what the querying of that data looks like.

For example if your data isn't very well structured / not concretely defined and it's only typically accessed by the key then a NoSQL solution is probably best. If the type of querying your users do is analytical (such as aggregative querying) then it might make sense to scale up a data warehouse or use things built into Azure SQL to improve analytical querying.

Finally I'll add from my previous experience with a not too far off server of similar data size and concurrency, $300 was not enough to support the hardware behind our server. Sure, we definitely could've made some design and implementation improvements, but I still think $300 just wouldn't cut it (for a cloud based solution). I'm going to assume the same will be true for you, but that's a big assumption since my example is only anecdotal and as David Browne mentioned, you're best off testing to see what works within your price range.

OTHER TIPS

Should a SQL (Azure) database where you spend about 150 to 300 euros per month easily handle data reading with about 250-500 mb per second on a table of ~100 million rows or is that getting close to the limits and should I look at other solutions?

Yes. That should work, with an appropriate design.

A 2 vCore Provisioned General Purpose Database with 1-year RI costs about that, and provides 10GB memory, more than half of which will be used for caching data, can support a significant volume of concurrent reads over the cached data.

The site has extreme peak traffic in a very short period of time.

And so you can probably do this for less using the Serverless Model, where you can, for instance, run at 0.5 vCore for most of the time, scale to 4 vCores whenever you need to, and even auto-pause when not in use.

See https://docs.microsoft.com/en-us/azure/azure-sql/database/resource-limits-vcore-single-databases#gen5-compute-generation-part-1

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top