Count floating-point instructions
-
20-12-2019 - |
Question
I am trying to count the number of floating point operations in one of my programs and I think perf
could be the tool I am looking for (are there any alternatives?), but I have trouble limiting it to a certain function/block of code. Lets take the following example:
#include <complex>
#include <cstdlib>
#include <iostream>
#include <type_traits>
template <typename T>
typename std::enable_if<std::is_floating_point<T>::value, T>::type myrand()
{
return static_cast <T> (std::rand()) / static_cast <T> (RAND_MAX);
}
template <typename T>
typename std::enable_if<!std::is_floating_point<T>::value, std::complex<typename T::value_type>>::type myrand()
{
typedef typename T::value_type S;
return std::complex<S>(
static_cast <S> (std::rand()) / static_cast <S> (RAND_MAX),
static_cast <S> (std::rand()) / static_cast <S> (RAND_MAX)
);
}
int main()
{
auto const a = myrand<Type>();
auto const b = myrand<Type>();
// count here
auto const c = a * b;
// stop counting here
// prevent compiler from optimizing away c
std::cout << c << "\n";
return 0;
}
The myrand()
function simply returns a random number, if the type T is complex then a random complex number. I did not hardcode doubles into the program because they would be optimized away by the compiler.
You can compile the file (lets call it bench.cpp
) with c++ -std=c++0x -DType=double bench.cpp
.
Now I would like to count the number of floating point operations, which can be done on my processor (Nehalem architecture, x86_64 where floating point is done with scalar SSE) with the event r8010
(see Intel Manual 3B, Section 19.5). This can be done with
perf stat -e r8010 ./a.out
and works as expected; however it counts the overall number of uops (is there a table telling how many uops a movsd
e.g. is?) and I am only interested in the number for the multiplication (see in the example above).
How can this be done?
Solution
I finally found a way to do this, although not using perf
but instead the corresponding perf API. One first has to define a perf_event_open
function which is actually a syscall:
#include <cstdlib> // stdlib.h for C
#include <cstdio> // stdio.h for C
#include <cstring> // string.h for C
#include <unistd.h>
#include <sys/ioctl.h>
#include <linux/perf_event.h>
#include <asm/unistd.h>
long perf_event_open(
perf_event_attr* hw_event,
pid_t pid,
int cpu,
int group_fd,
unsigned long flags
) {
int ret = syscall(__NR_perf_event_open, hw_event, pid, cpu, group_fd, flags);
return ret;
}
Next, one selects the events one wishes to count:
perf_event_attr attr;
// select what we want to count
std::memset(&attr, 0, sizeof(perf_event_attr));
attr.size = sizeof(perf_event_attr);
attr.type = PERF_TYPE_HARDWARE;
attr.config = PERF_COUNT_HW_INSTRUCTIONS;
attr.disabled = 1;
attr.exclude_kernel = 1; // do not count the instruction the kernel executes
attr.exclude_hv = 1;
// open a file descriptor
int fd = perf_event_open(&attr, 0, -1, -1, 0);
if (fd == -1)
{
// handle error
}
In this case I want to count simply the number of instructions. Floating point instructions can be counted on my processor (Nehalem) by replacing the corresponding lines with
attr.type = PERF_TYPE_RAW;
attr.config = 0x8010; // Event Number = 10H, Umask Value = 80H
By setting the type to RAW one can basically count every event the processor is offering; the number 0x8010
specifies which one. Note that this number is highly processor-dependent! One can find the right numbers in the Intel Manual 3B, Part2, Chapter 19, by picking the right subsection.
One can then measure the code by enclosing it in
// reset and enable the counter
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
// perform computation that should be measured here
// disable and read out the counter
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
long long count;
read(fd, &count, sizeof(long long));
// count now has the (approximated) result
// close the file descriptor
close(fd);