Implementing an Interrupt driven Sampling Profiler

Question 1

You can send signal to specific thread using pthread_kill and tid (gettid()) of target thread.

Right way of creating simple profilers is by using setitimer which can send periodic signal (SIGALRM or SIGPROF) for example, every 10 ms; or posix timers (timer_create, timer_settime, or timerfd), without needs of separate thread for sending profiling signals. Check sources of google-perftools (gperftools), they use setitimer or posix timers and collects profile with backtraces.

gprof also uses setitimer for implementing cpu time profiling (9.1 Implementation of Profiling - " Linux 2.0 ..arrangements are made for the kernel to periodically deliver a signal to the process (typically via setitimer())").

For example: result of codesearch for setitimer in gperftools's sources: https://code.google.com/p/gperftools/codesearch#search/&q=setitimer&sq=package:gperftools&type=cs

void ProfileHandler::StartTimer() {
  if (!allowed_) {
    return;
  }
  struct itimerval timer;
  timer.it_interval.tv_sec = 0;
  timer.it_interval.tv_usec = 1000000 / frequency_;
  timer.it_value = timer.it_interval;
  setitimer(timer_type_, &timer, 0);
}

You should know that setitimer has problems with fork and clone; it doesn't work with multithreaded applications. There is try to create helper wrapper: http://sam.zoy.org/writings/programming/gprof.html (wrong one) but I don't remember, does it work correctly (setitimer usually send process-wide signal, and not thread-wide). UPD: seems that since linux kernel 2.6.12, setitimer's signal is directed to the process as whole (any thread may get it).

To direct signal from timer_create to specific thread, you need gettid() (#include <sys/syscall.h>, syscall(__NR_gettid)) and SIGEV_THREAD_ID flag. Don't checked how to create periodic posix timer with thread_create (probably with timer_settime and non-zero it_interval).

PS: there is some overview of profiling in wikibooks: http://en.wikibooks.org/wiki/Introduction_to_Software_Engineering/Tools/Profiling

Question 2

If you must write a profiler, let me suggest you use a good one (Zoom) as your model, not a bad one (gprof). These are its characteristics.

There are two phases. First is the data-gathering phase:

When it takes a sample, it reads the whole call stack, not just the program counter.
It can take samples even when the process is blocked due to I/O, sleep, or anything else.
You can turn sampling on/off, so as to only take samples during times you care about. For example, while waiting for the user to type something, it is pointless to be sampling.

Second is the data-presentation phase. What you have is a collection of stack samples, where a stack sample is a vector of memory addresses, which are almost all return addresses. Each return address indicates a line of code in a function, unless it's in some system routine you don't have symbolic information for.

The key piece of useful information is residency fraction (usually expressed as a percent). If there are a total of m stack samples, and line of code L is present anywhere on n of the samples, then its residency fraction is n/m. This is true even if L appears more that once on a sample, that is still just one sample it appears on. The importance of residency fraction is it directly indicates what fraction of time statement L is responsible for. If you have taken m=1000 samples, and L appears on n=300 of them, then L's residency fraction is 300/1000 or 30%. This means that if L could be removed, total time would decrease by 30%. It is typically known as inclusive percent.

You can determine residency fraction not just for lines of code, but for anything else you can describe. For example, line of code L is inside some function F. So you can determine the residency fraction for functions, as opposed to lines of code. That would give you inclusive percent by function. You could look at function pairs, like on what fraction of samples do you see function F calling function G. That would give you the information that makes up call graphs.

There are all kinds of information you can get out of the stack samples. One that is often seen is a "butterfly view", where you have a "focus" on one line L or function F, and on one side you show all the lines or functions immediately above it in the stack samples, and on the other side all the lines of functions immediately below it. On each of these, you can show the residency fraction. You can click around in this to try to find lines of code with high residency fraction that you can find a way to eliminate or reduce. That's how you speed up the code.

Whatever you do for output, I think it is very important to allow the user to actually examine a small number of them, randomly selected. They convey far more insight than can be gotten from any method that condenses the information.

As important as it is to know what the profiler should do, it is also important to know what not to do, even if lots of other profilers do them:

self time. A useless number. Look at some reasonable-size programs and you will see why.
invocation counts. Of no help in finding code with high residency fraction, and you can't get it with samples alone anyway.
high-frequency sampling. It's amazing how many people, certainly profiler builders, think it is important to get lots of samples. Suppose line L is on 30% of 1000 samples. Then its true inclusive percent is 30 +/- 1.4 percent. On the other hand, if it is on 30% of 10 samples, its inclusive percent is 30 +/- 14 percent. It's still pretty big - big enough to fix. What happens in most profilers is people think they need "numerical precision", so they take lots of samples and accumulate what they call "statistics", and then throw away the samples. That's like digging up diamonds, weighing them, and throwing them away. The real value is in the samples themselves, because they tell you what the problem is.