How to use linux `perf` tool to generate "Off-CPU" profile

Question 1

The perf technique I published[1] was a high-overhead workaround, until perf has BPF support for doing this.

Right now, the lowest cost way of generating an off-CPU flame graph on Linux is on a 4.6+ kernel (which has BPF stack trace support), and with bcc/BPF. I wrote a tool for it, offcputime[2], which can be run with a -f option for "folded output", suitable for feeding into flamegraph.pl. This offcputime tool does the timing and stack counting all in kernel content, and dumps a report that is then printed with symbols.

One day, I expect that perf itself will be able to do this as well: run a BPF program that does the in-kernel counting, and dumping of a report.

In the meantime, we can use bcc/BPF. If for some reason you can't use bcc, you can, right now, take that offcputime program and write it in C. A more complicated version is available in the Linux source, as samples/bpf/offwaketime*. With the new BPF features on Linux, if there's a will, there's a way.

[1] http://www.brendangregg.com/blog/2015-02-26/linux-perf-off-cpu-flame-graph.html

[2] https://github.com/iovisor/bcc/blob/master/tools/offcputime_example.txt

Question 2

Brendan Gregg published instruction about Off-cpu flame graph generating: http://www.brendangregg.com/blog/2015-02-26/linux-perf-off-cpu-flame-graph.html and https://github.com/brendangregg/FlameGraph/issues/47#

Off-CPU time flame graphs may solve (say) 60% of the issues, with the remainder requiring walking the thread wakeups to find root cause. I explained off-CPU time flame graphs, this wakeup issue, and additional work, in my LISA13 talk on flame graphs (slides, youtube).

Here I'll show one way to do off-CPU time flame graphs using Linux perf_events.

# perf record -e sched:sched_stat_sleep -e sched:sched_switch \
 -e sched:sched_process_exit -a -g -o perf.data.raw sleep 1
# perf inject -v -s -i perf.data.raw -o perf.data
# perf script -f comm,pid,tid,cpu,time,period,event,ip,sym,dso,trace | awk '
NF > 4 { exec = $1; period_ms = int($5 / 1000000) }
NF > 1 && NF <= 4 && period_ms > 0 { print $2 }
NF < 2 && period_ms > 0 { printf "%s\n%d\n\n", exec, period_ms }' | \
./stackcollapse.pl | \
./flamegraph.pl --countname=ms --title="Off-CPU Time Flame Graph" --colors=io > offcpu.svg

stackcollapse.pl and flamegraph.pl from Gregg are used to draw flamegraph.

There are perf options used from 3.17 kernels and newer...