Analyzing jHiccup latency. Source of 57us hiccups

https://stackoverflow.com/questions/10294089

03-06-2021
|

Question

I am measuring "hiccups" using jHiccup tool build by Azul Systems. It collects data to identify the frequency and duration of pause times (hiccups) that occur while a JVM runs a Java application. It works on JVM level and beyond ( OS,driver and etc ).

Here is the results jHiccup These results were gained on machine with SUSE SLERT 11 2.6.33 kernel PREEMPT RT, Intel i5, 4g memory. The process was running under cpu shielding ( 3 logical processors were isolated ) and with 99 priority (FIFO). I am wondering where this 57 mcs latency comes from. The application is pretty straightforward. It is network order processing system, so it parsing TCP packets feed, doing simple business logic and thats all. No GC, syncronizations, it is single threaded.

My guess is it might be a network issue e.g. blocking read? When I tryed non-blocking read with busy-waiting, I got similar results, but maybe i am doing it wrong. I have no clue where these hiccups can come from.

Solution

IRQ Balance will distribute the interrupt handling across your cpus. You can turn this off or control its masks so you don't get interrupted as much (Unfortunately there are two intrrupts you cannot turn off)

Logical processes on the same core can disturb each other. For best results, I would isolate a core and only use that.

Even though you have shielded the application it has many threads. For best results I have used linux to isolate a number of cores and only assign the critical threads to those cores. i.e. the other threads in the same application don't use those cores at all.

To control this, I have written this library Java Thread Affinity Even with this library, I see some jitter (though as much as 10x less) which could be caused by the power management or Local Timer Interrupt.

OTHER TIPS

This is a very interesting question, and it's also an unusual jHiccup profile. Working in a big bank, I typically see multimodal jHiccup curves for complex applications - you appear to have a single path that your 20% to your 99.999% transactions take. Thats extraordinary and something a lot of people would love to emulate (though they might want the 57usec to be a lot smaller). There are so many things that could be causing this, that it is probably most effective to bisect the problem by finding a variable that can shift the 57usec number - CPU frequency, NIC latency, context switch cost, synchronous write cost, thread scheduling fairness.

There are is a lot you can do to to dig deeper:

Analysis- I'm struck by how flat your "hiccups by percentile distribution" curve is, and that it extends beneath the 90%, suggesting that there's a single, very common pause event of around 57 micros. What happens if you reduce the bucket size and the horizontal axis - do you see whether the pauses are uniformly, normally, binomially, or periodically distributed? Are you using 10GB? Does your app show a very constant correlation between workload and context switching (above 0.85 r-squared)?
You can try to tweak a couple of knobs and see whether the size of the 57micro pause changes. Remember this isnt tweaking to improve its tweaking to see the needle move in either direction. You say that disabling irq_balancer didn't help - did it cause the size of your pause to vary? I'd begin with testing whether CPU frequency affects it. If you are running on an E5-2690, do you see the same or different latencies on an E5-2650? If you don't have diverse hardware you could try to get at this by changing the max cset/ turbo settings. I'd also try adjusting the IRQ consolidation settings on the NIC, so as to change the bucket size of NIC batching of network operations. If neither cause the needle to move then you know that it isn't a straightforward NIC latency or CPU effect.

In the same vein, I'd also try running on an older kernel that has the RHEL 5 memory barrier bugs, the faster context switches, and different process scheduling fairness behavior. Tools like https://github.com/tsuna/contextswitch can characterize this stuff. Once you identify a variable that will shift the 57 micro amplitude of your pauses you are 75% the way there.

You might also try using Zing, if you're currently using Oracle JVM,and seeing if that shifts anything.

Let us know what happens,

Peter

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow