C++: Loop Optimization and Loop Unwinding (To loop or not to loop)

Question 1

What you're talking about is called loop unwinding. The performance trade-offs are complex and depend on many aspects of both the compiler and the execution environment. See the Wikipedia article on loop unwinding for a discussion of the issues.

Question 2

There is no single answer about whether unrolling all 100 iterations of the loop would be effective.

For "smaller" system with no code cache, chances are pretty good that unrolling all 100 iterations would be optimal, at least in terms of execution speed. On the other hand, a system small enough that its CPU doesn't have a cache will typically be constrained enough in other resources that doing so would be highly inadvisable.

If the system does have a cache, chances are quite good that unrolling all 100 iterations of the loop would tend to cause slower execution. The overhead of the loop itself almost certainly takes less time than re-fetching essentially identical code 100 times over.

In a typical case, loop unrolling is most effective when a few iterations of the loop are unrolled (but typically fewer than 100 iterations). In a typical case you'd see a broad plateau around 4 to 16 iterations being unrolled.

As is typical of many taking a first stab at optimization, however, I'd guess you're really looking in entirely the wrong direction. If you want to optimize that loop, chances are that (by far) the biggest gain will come from making a slight change to what you do in the loop. I'd be willing to bet that any improvement you get from unrolling the loop will be too small to measure dependably, not to mention actually notice (even if you increase the number of iterations from 100 to, say, a few million).

On the other hand, if you rewrite the loop to eliminate the unnecessary buffer flush every iteration:

for ( int i = 1; i <= 100; i++ ) 
    cout << i << "\n";

[In case you didn't realize it: std::endl inserts a new-line into a stream and flushes the stream. In most cases (probably including this one) the buffer flush is unnecessary probably inadvisable. Removing it can improve speed a lot--improvement by a factors of 8:1 or 10:1 is fairly common.]

Chances are that it won't take much to measure the difference in speed at all. There's a pretty fair chance you'll be able to measure it at 100 iterations, and if you try more iterations, the difference is likely to become almost painfully obvious.

When you're dealing with a loop that's not I/O bound, and not open to obvious, massive improvement like this one, loop unrolling is likely to become a more attractive option. In this case, you first need to be aware that most compilers can do loop unrolling automatically, so trying to do it in the source code is unlikely to help a lot unless that opens up opportunities for other optimizations (e.g., if you have a loop that really does one thing on even iterations and another on odd iterations, unrolling those two iterations can eliminate the condition and jumping and such completely, so doing it by hand may provide a meaningful improvement, since the compiler may not "notice" the odd/even pattern and eliminate the conditions, jumps, etc.

Also note that a modern CPU can (and typically will) execute code in parallel, and execute code speculatively, which can eliminate most of the overhead of a loop. Since the branch of the loop will nearly always be taken (i.e., in all but the last iteration) the CPU's branch predictor will predict it as taken, so the CPU may have several iterations worth of instructions "in flight" simultaneously, even when you don't unroll the loop. Most of the code for the loop itself (e.g., incrementing i) can be executed in parallel with at least some other code in the loop, so the overhead of the loop is likely to be quite minimal anyway.

Edit 2: Looking at the specific question at hand, I think I'd do the job rather differently. Instead of storing the TTT board as a 2D array, I'd store it as a pair of bitmaps, one for X's and the other for O's. This lets you test for an entire winning combination in a single action instead of three separate comparisons. Since each row is 3 bits, it's probably easiest to use octal for the constants:

static const std::array<short, 8> winners = {
    /* rows */      0007, 0070, 0700, 
    /* columns */   0111, 0222, 0444, 
    /* diagonals */ 0124, 0421
};

In this case, I almost certainly would use loops:

char CheckForWinner(short X, short O) { 
    // `winners` definition from above goes here.

    for (int i=0; i<winners.size(); i++) {
        if (X & winners[i] == winners[i])
            return 'X';
        if (O & winners[i] == winners[i])
            return 'O';
    }
    return ' ';
}

The big question here would be whether you really want to pass the X and O boards separately, or whether it makes more sense to pass an array of two shorts. The obvious advantage of using an array would be easier access to the opposite board. For example, to test whether a move is allowed in one board, you'd check whether that bit is set in the other board. With the boards stored in an array, you can be passed an n indicating the board where you want to make a move, and use 1-n to get the other board, where you'll check if that bit is already set.

Question 3

By encoding which positions are part of which lines, you can perform the win check very efficiently:

char square[3][3] = {' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '};
char player = 'x';
unsigned progress[2];

const unsigned lines[3][3] = {
    0x10010010,
    0x10001000,
    0x10000101,

    0x01010000,
    0x01001011,
    0x01000100,

    0x00110001,
    0x00101000,
    0x00100110
};

The encoding is "top row, middle row, bottom row, left column, middle column, right column, downward diagonal, upward diagonal".

For example, the top-left position is part of the top row, the left column and the downward diagonal.

As soon as you have 3 pieces in the same line, the line is full and you win, so just keep adding the lines until you hit 3. You can recognize a 3 by two consecutive 1 bits, so p & (p >> 1) will be non-zero:

void make_move(int y, int x)
{
    square[y][x] = player;
    unsigned p = (progress[player & 1] += lines[y][x]);
    if (p & (p >> 1))
    {
        printf("player %c has won!\n", player);
        exit(0);
    }
    else
    {
        player = 'x' + 'o' - player;
    }
}

Question 4

When thinking about the loop unwinding it is necessary to estimate the weight ratio between the body of the loop and the loop organisation overhead.

It is true that even the simplest for loop will add several instructions overhead. But in your case complexity of the I/O call will overweight these instructions 10-100 times.

Unwinding makes sense when the body of the loop is doing some manipulation in the memory that is requires several, maybe a dozen of asm instructions. For example:

// Process digits starting fom the last one.
wchar_t carry_bit = 0;
while (curr_digit_offs >= 0)
{
    wchar_t ch = fpb[curr_digit_offs];
    fpb[curr_digit_offs--] = g_RawScan_MultiplyBy2[ch & 15] + carry_bit;
    carry_bit = (ch >= L'5') ? TRUE : FALSE;
}

In the example above the body of the loop is not calling any external function. It only works with data structures in memory. This means that its complexity can be estimated.

In every particular case separate estimation is needed.

C++: Loop Optimization and Loop Unwinding (To loop or not to loop)

Update: