Question

I am trying to count the number of floating point operations in one of my programs and I think perf could be the tool I am looking for (are there any alternatives?), but I have trouble limiting it to a certain function/block of code. Lets take the following example:

#include <complex>
#include <cstdlib>
#include <iostream>
#include <type_traits>

template <typename T>
typename std::enable_if<std::is_floating_point<T>::value, T>::type myrand()
{
        return static_cast <T> (std::rand()) / static_cast <T> (RAND_MAX);
}

template <typename T>
typename std::enable_if<!std::is_floating_point<T>::value, std::complex<typename T::value_type>>::type myrand()
{
        typedef typename T::value_type S;

        return std::complex<S>(
                static_cast <S> (std::rand()) / static_cast <S> (RAND_MAX),
                static_cast <S> (std::rand()) / static_cast <S> (RAND_MAX)
        );
}

int main()
{
    auto const a = myrand<Type>();
    auto const b = myrand<Type>();

    // count here
    auto const c = a * b;
    // stop counting here

    // prevent compiler from optimizing away c
    std::cout << c << "\n";

    return 0;
}

The myrand() function simply returns a random number, if the type T is complex then a random complex number. I did not hardcode doubles into the program because they would be optimized away by the compiler.

You can compile the file (lets call it bench.cpp) with c++ -std=c++0x -DType=double bench.cpp.

Now I would like to count the number of floating point operations, which can be done on my processor (Nehalem architecture, x86_64 where floating point is done with scalar SSE) with the event r8010 (see Intel Manual 3B, Section 19.5). This can be done with

perf stat -e r8010 ./a.out

and works as expected; however it counts the overall number of uops (is there a table telling how many uops a movsd e.g. is?) and I am only interested in the number for the multiplication (see in the example above).

How can this be done?

Was it helpful?

Solution

I finally found a way to do this, although not using perf but instead the corresponding perf API. One first has to define a perf_event_open function which is actually a syscall:

#include <cstdlib> // stdlib.h for C
#include <cstdio> // stdio.h for C
#include <cstring> // string.h for C
#include <unistd.h>
#include <sys/ioctl.h>
#include <linux/perf_event.h>
#include <asm/unistd.h>

long perf_event_open(
    perf_event_attr* hw_event,
    pid_t pid,
    int cpu,
    int group_fd,
    unsigned long flags
) {
    int ret = syscall(__NR_perf_event_open, hw_event, pid, cpu, group_fd, flags);
    return ret;
}

Next, one selects the events one wishes to count:

perf_event_attr attr;

// select what we want to count
std::memset(&attr, 0, sizeof(perf_event_attr));
attr.size = sizeof(perf_event_attr);
attr.type = PERF_TYPE_HARDWARE;
attr.config = PERF_COUNT_HW_INSTRUCTIONS;
attr.disabled = 1;
attr.exclude_kernel = 1; // do not count the instruction the kernel executes
attr.exclude_hv = 1;

// open a file descriptor
int fd = perf_event_open(&attr, 0, -1, -1, 0);

if (fd == -1)
{
    // handle error
}

In this case I want to count simply the number of instructions. Floating point instructions can be counted on my processor (Nehalem) by replacing the corresponding lines with

attr.type = PERF_TYPE_RAW;
attr.config = 0x8010; // Event Number = 10H, Umask Value = 80H

By setting the type to RAW one can basically count every event the processor is offering; the number 0x8010 specifies which one. Note that this number is highly processor-dependent! One can find the right numbers in the Intel Manual 3B, Part2, Chapter 19, by picking the right subsection.

One can then measure the code by enclosing it in

// reset and enable the counter
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);

// perform computation that should be measured here

// disable and read out the counter
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
long long count;
read(fd, &count, sizeof(long long));
// count now has the (approximated) result

// close the file descriptor
close(fd);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top