How is Loop skewing making loop parallelizable?

Question

I have no credit for this, I just formatted it for you and copied it from another source, and I hope it help you

[source: ECE 1754, Survey of Loop Transformation Techniques, Eric LaForest, March 19, 2010]

It is all about the distance between a two executive iterations, in the first the distance is 1 between one outer loop and inner loop, so there is a dependency between them.

Loop skewing does exactly what it says: it skews the execution of an inner loop relative to an outer one. This is useful if the inner loop has a dependence on the outer loop which prevents it from running in parallel. For example, the following code has a dependency vector of {(1, 0),(0, 1)} .Neither loop can be parallelized since they each carry a dependency. Simply interchanging the loops would merely interchange the indices holding the dependencies, accomplishing nothing.

do i = 2, n-1
do j = 2, m-1
a[i,j] =
      (a[i-1,j] + a[i,j-1] + a[i+1,j] + a[i,j+1]) / 4
end do
end do

Loop skewing is implemented by adding the index of the outer loop, times some skewing factor f, to the bounds of the inner loop and subtracting the same value from all the uses of the inner loop index. The subtraction keeps the indices within the new loop bounds, preserving the correctness of the program. The effect on the inner loop iterations is to shift their position in the array forwards by f relative to the current outer loop, increasing the dependency distance to the outer loop in the same manner. In other words, given a dependency vector (a, b), skewing transforms it to (a, f a + b). Since this transformation preserves the lexicographic order of the dependencies, it is always legal. Applying a skew factor of one to the above inner loop yields the following code:

do i = 2, n-1
do j = 2+i, m-1+i
a[i,j-i] =
(a[i-1,j-i] + a[i,j-1-i] + a[i+1,j-i] + a[i,j+1-i]) / 4
end do
end do

This new code executes in the same manner, but with dependencies of {(1, 1),(0, 1)}. Both loops still carry a dependency. However, interchanging the loops at this point yields a dependence vector {(1, 0),(1, 1)}, as shown in the following code:

do j = 4, m+n-2
do i = max(2, j-m+1), min(n-1, j-2)
a[i,j-i] =
(a[i-1,j-i] + a[i,j-1-i] + a[i+1,j-i] + a[i,j+1-i]) / 4
end do
end do

The inner loop can now be parallelized since it has now no loop-carried dependency on j, and the dependency to i is carried by the outer loop.Note that interchanging skewed loop bounds is no longer straightforward: each loop must take into account the upper and lower bounds of the other loop.