How does 95th percentile determing what should be default timeout settings on client?

https://stackoverflow.com//questions/24022588

21-12-2019
|

Domanda

I have a client library which makes a call to my service using RestTemplate and get the response back as a String and then return this response string back to customer who is using our Client.

Now the overall SLA (95th percentile) for our client is ~15 ms, meaning 95 percentage of time, calls should come back within ~15 ms from our client. And the default timeout settings which our client has internally is ~500 ms.

And after doing some Load and Performance testing on our client, performance is looking quite good and the 95th percentage of time, call is coming back within ~10 ms.

Problem Statement:-

Now my question is -

As I mentioned above default timeout value is ~500 ms internally in our client and then 95th percentile is coming as ~10 ms after our Load and Performance testing. I read it somewhere that you are always supposed to set higher timeout value then your current SLA so that you should allow all the calls go through and then measure the 95th percentile. Is this true? Or I should set 60-70 ms timeout settings internally on the client? But I guess, in this case, majority of the calls are going to timedout?
If I set internal default timeout value as ~50 ms on our client, then from my understanding 95th percentile will not come within ~15 ms as we are not allowing all the calls to go through. Right?

I am just trying to understand few things, meaning what does it determine the 95th percentile if the timeout value is set pretty high as compared to our SLA vs timeout value set pretty close to our SLA? Meaning if SLA is ~15 ms and if I set timeout value as ~100 ms internally vs if SLA is ~15 ms and then if I set timeout value as ~500 ms .

Soluzione

You need to be more precise. It doesn't mean "95 percentage of time". It means either "95% of completed calls" or "95% of all calls, including complete failures". Probably the latter, but you need to check.

Then you need to do two things.

For compliance testing, set the timeout = the SLA so you can measure whether you comply.
Then in production set a sensible timeout. 15ms and 500ms are both far too short for production timeouts. I would set it to a couple of seconds at least, maybe as high as 30s. One rule of thumb is to set it to double the expected service time, but that's far too short in this case,

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow