Why does the first printf take longer?

https://stackoverflow.com/questions/7284951

19-01-2021
|

문제

I was playing around with high precision timers and one of my first tests was to use rdtsc to measure printf. Below is my test prpgram followed by its output. The thing I noticed is that the first time printf runs, it consistently takes about 25 times longer on the first print than it does on subsequent prints. Why is that?

#include <stdio.h>
#include <stdint.h>

// Sample code grabbed from wikipedia
__inline__ uint64_t rdtsc(void)
{
    uint32_t lo, hi;
    __asm__ __volatile__ (
            "xorl %%eax,%%eax \n        cpuid"
            ::: "%rax", "%rbx", "%rcx", "%rdx");
    __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
    return (uint64_t)hi << 32 | lo;
}

int main(int argc, const char *argv[])
{
    unsigned int i;
    uint64_t counter[10];
    uint64_t sum = 0;
    for (i = 0; i < 10; i++)
    {
        counter[i] = rdtsc();
        printf("Hello, world\n");
        counter[i] = rdtsc() - counter[i];
    }

    for (i = 0; i < 10; i++)
    {
        printf("counter[%d] = %lld\n", i, counter[i]);
        sum += counter[i];
    }
    printf("avg = %lld\n", sum/10);
    return 0;
}

And the output:

Hello, world
Hello, world
Hello, world
Hello, world
Hello, world
Hello, world
Hello, world
Hello, world
Hello, world
Hello, world
counter[0] = 108165
counter[1] = 6375
counter[2] = 4388
counter[3] = 4388
counter[4] = 4380
counter[5] = 4545
counter[6] = 4215
counter[7] = 4290
counter[8] = 4237
counter[9] = 4320
avg = 14930

(For reference, this was compiled with gcc on OSX)

해결책

My guess is that on the first call to printf, the stdout resources are not in cache and the call will need bring it into cache - hence it's slower. For all subsequent calls, the cache is already warm.

A second possible explanation is that, if this is on Linux (may also apply to OSX, I'm not sure), the program needs to set the stream orientation. (ASCII vs. UNICODE) This is done on the first call to a function using that stream and is static until the stream closes. I don't know what the overhead of setting this orientation is, but it's a one-time cost.

Please feel free to correct me if anyone thinks I'm completely wrong.

다른 팁

Perhaps the first time, the code for printf isn't in the instruction cache, so it has to be loaded in from main memory. On subsequent runs, it's already in the cache.

That's about 50 microseconds. Perhaps a caching issue? Too short to be anything to do with being loaded from the harddrive, but believable for loading a large chunk of C I/O library from RAM.

It can be some sort of lazy initialization.

In both hardware and software design, there's an overriding principle which suggests that the execution speed of something that's done a million times is far more important than the execution speed of something that's done once. A corollary of this is that if something is done a million times, the time required to do something the first time is far less important than the time required for the other 999,999. One of the biggest reasons computers are so much faster today than 25 years ago is that designers have focus on making repeated operations faster, even when doing so might slow down the performance of one-off operations.

As a simple example from a hardware perspective, consider two approaches to memory design: (1) there is a single memory store, and every operation takes sixty nanoseconds to complete; (2) there are several levels of cache; fetching a word which is held in the first level of cache will take one nanosecond; a word which isn't there, but is held in the second level will take five; a word which isn't there but is in the third level will take ten, and one which isn't there will take sixty. If all memory accesses were totally random, the first design would not only be simpler than the second, but it would also perform better. Most memory accesses would cause the CPU to waste ten nanoseconds looking up data in the cache before going out and fetching it from main memory. On the other hand, if 80% of memory accesses are satisfied by the first cache level, 16% by the second, and 3% by the third, so only one in a hundred have to go out to main memory, then the average time for those memory accesses will be 2.5ns. That's forty times as fast, on average, as the simpler memory system.

Even if an entire program is pre-loaded from disk, the first time a routine like "printf" is run, neither it nor any data it requires is likely to be in any level of cache. Consequently, slow memory accesses will be required the first time it's run. On the other hand, once the code and much of its required data have been cached, future executions will be much faster. If a repeated execution of a piece of code occurs while it is still in the fastest cache, the speed difference can easily be an order of magnitude. Optimizing for the fast case will in many cases cause one-time execution of code to be much slower than it otherwise would be (to an even greater extent than suggested by the example above) but since many processors spend much of their time running little pieces of code millions or billions of time, the speedups obtained in those situations far outweigh any slow-down in the execution of routines that only run once.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow