Loop unrolling and its effects on pipelining and CPE (have the solution, but don't understand it)

https://stackoverflow.com/questions/13229599

27-11-2021
|

Question

Below the line is a question on a practice test. The table actually has all the solutions filled in. However, I need clarification upon why the solutions are what they are. (Read the question below the horizontal line).

For example, I would really like to understand the solution row for A2 and A3.

As I see it, you have the following situation going on in A2:

x * y
xy * r
xyr * z

Now, let's look at how that'd be in the pipeline:

|1|2|3|4|5|6|7|8 |9|10|11|12|13|14|15|16|17|18|19|20|21|
| | | | | | | |  | |  |  |  |  |  |  |  |  |  |  |  |  |
{ x * y } | | |  | |  |  |  |  |  |  |  |  |  |  |  |  |
        { xy * r } |  |  |  |  |  |  |  |  |  |  |  |  |
                 { xyr * z  }  |  |  |  |  |  |  |  |  |
//next iteration, which means different x, y and z's|  |
                   {x2 * y2    }  |  |  |  |  |  |  |  |
                               {x2y2 * r   } // this is dependent on both previous r and x2y2
                                           {x2y2r * z  }

So we are able to overlap xyr * z and x2 * y2, because there are no dependency conflicts. However, that is only getting rid of 3 cycles right?

So it would still be (12 - 3) / 3 = 9 / 3 = 3 Cycles Per Element (three elements). So how are they getting 8/3 CPE for A2?

Any help understanding this concept will be greatly appreciated! There's not a big rush, as the test isn't til next week. If there is any other information you need, please let me know!

(Below is the full test question text, along with the table completely filled in with the solutions)

Consider the following function for computing the product of an array of n integers.

We have unrolled the loop by a factor of 3.

int prod(int a[], int n) {

int i, x, y, z;
int r = 1;

for(i = 0; i < n-2; i += 3) {
    x = a[i]; y = a[i+1]; z = a[i+2];
    r = r * x * y * z; // Product computation
}
for (; i < n; i++)
    r *= a[i];

return r;
}

For the line labeled Product computation, we can use parentheses to create five different associations of the computation, as follows:

r = ((r * x) * y) * z; // A1
r = (r * (x * y)) * z; // A2
r = r * ((x * y) * z); // A3
r = r * (x * (y * z)); // A4
r = (r * x) * (y * z); // A5

We express the performance of the function in terms of the number of cycles per element (CPE). As described in the book, this measure assumes the run time, measured in clock cycles, for an array of length n is a function of the form Cn + K, where C is the CPE.

We measured the five versions of the function on an Intel Pentium III. Recall that the integer multiplication operation on this machine has a latency of 4 cycles and an issue time of 1 cycle.

The following table shows some values of the CPE, and other values missing. The measured CPE values are those that were actually observed. “Theoretical CPE” means that performance that would be achieved if the only limiting factor were the latency and issue time of the integer multiplier.

enter image description here

Fill in the missing entries. For the missing values of the measured CPE, you can use the values from other versions that would have the same computational behavior. For the values of the theoretical CPE, you can determine the number of cycles that would be required for an iteration considering only the latency and issue time of the multiplier, and then divide by 3.

Solution

Without knowing the CPU architecture, we can only guess.

My interpretation would be that the timing diagram only shows part of the pipeline, from gathering the operands to writing the result, because this is what is relevant to dependency resolution.

Now, the big if: If there is a buffer stage between the dependency resolver and the execution units, it would be possible to start the third multiplication of the first group (3) and the first multiplication of the second group (4) both at offset 8.

As 3 is dependent on 2, it does not make sense to use a different unit here, so 3 is queued to unit 1 right after 2. The following instruction, 4 is not dependent on a previous result, so it can be queued to unit 2, and started in parallel.

In theory, this could happen as early as cycle 6, giving a CPE of 6/3. In practice, that is dependent on the CPU design.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow