Maximizing SQL Server Service Broker Throughput

https://stackoverflow.com/questions/16488162

21-04-2022
|

题

We've implemented an SSB messaging solution for our application, but are running into scaling issues now. Can anyone with experience in the scaleup of SSB applications offer any suggestions as to what we might be doing wrong?

The setup is that we use a single initiator queue that feeds a single target queue with an activated procedure. The activated procedure processes received messages and selectively dispatches them to clients that have registered for messages of a relevant type.

This second-stage dispatch again uses a single intitiator queue (different than the one used for initial message injection) and sends messages to whatever number of client queues are determined to be appropriate.

Each client performs operations against the database that create messages that get sent to all other clients, so it's an N^2 scaling problem. For relatively small numbers of clients (10 or less) this isn't posing a problem for us, but when we scale up to the N=35 or N=40 range, we start to enqueue messages faster than we can process them at some point in the workflow, and we start suffering significant latency problems. The load we're failing at is still well under what has been reported as best-case performance for a SSB implementation, though, so I'm sure there's a flaw in our implementation.

Relevant diagnostics include:

Our server has plenty of CPU, I/O, and network bandwidth available even under the heaviest client loading we've seen, even while messages are backing up in the queues.
We've configured the system to activate anywhere from 5 copies of the activated procedure to 512 copies, with little discernable effect on throughput and end-user performance.
The activated procedure operates on multiple messages at a time, and processes them with some mild XML queries and SELECTS against some small database tables. We've tested this procedure under no-load conditions and its overhead is light.
We are showing high percentages of LCK_M_X, PAGELATCH_SH, PAGELATCH_EX, and WRITELOG waits (those are the top 4 offenders).
We're showing approximately twice the number of SENDs/sec than we are seeing RECEIVEs/sec under our heaviest load.

If there are other diagnostics that would be helpful for anyone who might have an idea about what we can do to speed our configuration up, I can probably find them.

解决方案

We're showing approximately twice the number of SENDs/sec than we are seeing RECEIVEs/sec under our heaviest load.

I think this is the crux of the problem. The counter measures the statement execution rate, not the messages. This means that your RECEIVE receives probably only one or two messages on each result set. Because of conversation group locking RECEIVE is limited to retrieve only one conversation group on each result it returns. even if there are thousands of messages available in the queue, if they're all on separate conversations RECEIVE will return only one. Which usually results in poor performance and in symptoms just as you describe.

To achieve high throughput you'll have to somehow get the messages to belong to few conversations so that RECEIVE can yield a significant result set on the queues that have the problems. How to achieve this depends on the specifics of your business workflow.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow