Server architecture for short bursts of ~150 parallel CPU-heavy subqueries

https://softwareengineering.stackexchange.com/questions/333269

30-12-2020
|

题

The client sends a query (a few hundred characters) to the web service. This query can be split into 20 to 150 subqueries with a simple regex. Those subqueries can then be computed independently and take each up to 5 seconds. Thus we'd like to have those subqueries run in parallel in order for the original query to return quickly.

Does it make sense to set up an Amazon Lambda function for the subquery, put it behind a HTTP Gateway and then have a small application server that waits for all subqueries to finish, integrate them and send them back to the client? Or are we better off running up to 150 threads on an EC2 heavy instance?

We want the service to scale easily but don't expect a lot of users in the beginning. However, for those users the query should complete withtin ~5 seconds.

Note: AWS is not a requirement, I'm just using it as an example.

解决方案

Since you are CPU-limited, you need to get your hands on 150 CPU cores, one for each thread. This rules out a single server, since a server of such proportions would be prohibitively expensive – and you don't really need it.

Your general architecture with a common frontend that distributes work to multiple workers and combines their results appears to be sensible. You'll have to do some calculations to find the most cost-effective solution to get that many CPUs. That tends to point towards AWS Lambda since you only require computations in bursts, but it may come with restrictions. How many Lambdas may execute simultaneously? 150 at once is a lot. Which languages can you use; can you reduce your cost by using optimized native code? Importantly, I don't think Amazon makes specific performance guarantees for that product, whereas you have more control over the physical CPU with more traditional instance types.

And the actual CPU performance is important for you. While you are willing to kill the computation after 5 seconds, the amount of computation performed until then may vary wildly. You could probably manage to get 150 cores rather cheaply by running a Beowulf-cluster of Raspberry Pi boards in your basement, but that is not remotely comparable to the computation power of five high-end Intel Xeon servers.

It is therefore important that you clearly define your performance goals and a SLA and then test a proposed solution. You will also have to think about simultaneous requests. Given the high amount of computations per client request, it may be best to process client requests sequentially if that is acceptable for the clients. But this also puts an upper limit on the clients you can support, since the probability that a client has to wait before their request can be processed grows rather quickly (related to the birthday paradox).

This is a scalability problem. You can either delay it by scheduling client requests as to avoid simultaneous requests, or gain the ability to handle multiple requests in parallel. That in turn can either be managed by throwing more money/servers at the problem, or by performance-tuning of the algorithm. E.g. I've seen a case where a Python program could be made 3× faster by profile-guided optimizations like extracting an instance attribute access out of a very tight loop. The biggest wins always come from algorithmic complexity reduction, if they are possible.

其他提示

I would design my interface and task master under two assumptions:

Processing power is distributed
I don't always have enough processing power available

If the jobs are inherently long, and the client/end-user knows this, my preference is to make the task master respectful of "competing" tasks. I'll favor progress notifications and mechanisms for client-initiated cancellations over a "moody" service that just "gives up" when it feels like it. Time limits should be oriented around stopping requests that aren't actually making progress; not just lengthy requests that the client might want to wait for.

Also consider that, even if I have 150 CPU's (distributed or not), but I also happen to get 2 simultaneous requests on occasion, requests will start failing. And my clients won't be happy if their jobs start failing "because someone else was also using the service."

And just from a pragmatic development perspective, I want to be pretty agnostic about the hosting environment. It's going to be easier to code for distributed work up front than it will be to later decide to distribute work that depends on threads, and possibly even shared state (or whatever).

... On the other hand, if the client expects these requests to be quick, you may focusing on the wrong thing here! You may need to look for optimizations, caching, and heuristics first. (Unless you really want to buy and manage ~150 CPU's per concurrent request.)

It's difficult to scale up with a high CPU or IO usage.

In my personal experience, to scale up you need to better understand the problem and to try to solve in best possible way.
For example, you can try to cache some results so the server does't needs to compute all results. This is very simple to implement and usually you save some cpu cycles.
The Regex is a very powerful tool but with a big cpu cost. Some db or server can extend with native code, you can try to change the regex to native code, but this code became native and more complicate to manage.
You can try to translate the regex in more simple query and measure the performance.
Another possible solution is to pass throw a temporary table, so you can reduce the numbers of the row before use the regex function.

许可以下： CC-BY-SA 和归因

不隶属于 softwareengineering.stackexchange