Why is univariate Horner in Fortran faster than NumPy counterpart while bivariate Horner is not

Question 1

Following the other suggestions, using p=np.asfortranarray(p) before the timer indeed puts the performance on par with numpy when I tested it. I extended the range for the bivariate bench to n_bi = np.array([2**i for i in xrange(1, 15)]), so that the p matrix would be larger than my L3 cache size.

To further optimize this, I don't think automatic compiler options will be much help, since the inner loop has a dependency. Only if you manually unroll it, does ifort vectorize the innermost loop. With gfortran, -O3 and -ffast-math were needed. For matrix sizes limited by main memory bandwidth, this increase the performance benefit over numpy from a factor of 1 to 3.

Update: after applying this also to the univariate code and compiling with f2py --opt='-O3 -ffast-math' -c -m polynomial polynomial.f90, I get the following for the source and results for benchmark.py:

subroutine polyval(p, x, pval, nx)

implicit none

real*8, dimension(nx), intent(in) :: p
real*8, intent(in) :: x
real*8, intent(out) :: pval
integer, intent(in) :: nx

integer, parameter :: simd = 8
real*8 :: tmp(simd), vecx(simd), xfactor
integer :: i, j, k

! precompute factors
do i = 1, simd
    vecx(i)=x**(i-1)
end do
xfactor = x**simd

tmp = 0.0d0
do i = 1, nx, simd
    do k = 1, simd
        tmp(k) = tmp(k)*xfactor + p(nx-(i+k-1)+1)*vecx(simd-k+1)
    end do
end do
pval = sum(tmp)


end subroutine polyval

subroutine polyval2(p, x, y, pval, nx, ny)

implicit none

real*8, dimension(nx, ny), intent(in) :: p
real*8, intent(in) :: x, y
real*8, intent(out) :: pval
integer, intent(in) :: nx, ny

integer, parameter :: simd = 8
real*8 :: tmp(simd), vecx(simd), xfactor
integer :: i, j, k

! precompute factors
do i = 1, simd
    vecx(i)=x**(i-1)
end do
xfactor = x**simd

! horner
pval=0.0d0
do i = 1, ny
    tmp = 0.0d0
    do j = 1, nx, simd
        ! inner vectorizable loop
        do k = 1, simd
            tmp(k) = tmp(k)*xfactor + p(nx-(j+k-1)+1,ny-i+1)*vecx(simd-k+1)
        end do
    end do
    pval = pval*y + sum(tmp)
end do

end subroutine polyval2

Update 2: As pointed out, this code is not correct, at least when sizes are not divisible by simd. It's just showing the concept of manually helping the compiler, so don't just use it like this. If the sizes are not powers of two, a small remainder loop has to take care of the dangling indices. It's not so difficult to do this, here is the correct procedure for the univariate case, should be straightforward to extend it to bivariate:

subroutine polyval(p, x, pval, nx)
implicit none

real*8, dimension(nx), intent(in) :: p
real*8, intent(in) :: x
real*8, intent(out) :: pval
integer, intent(in) :: nx

integer, parameter :: simd = 4
real*8 :: tmp(simd), vecx(simd), xfactor
integer :: i, j, k, nr

! precompute factors
do i = 1, simd
    vecx(i)=x**(i-1)
end do
xfactor = x**simd

! check remainder
nr = mod(nx, simd)

! horner
tmp = 0.0d0
do i = 1, nx-nr, simd
    do k = 1, simd
        tmp(k) = tmp(k)*xfactor + p(nx-(i+k-1)+1)*vecx(simd-k+1)
    end do
end do
pval = sum(tmp)

! do remainder
pval = pval * x**nr
do i = 1, nr
    pval = pval + p(i) * vecx(i)
end do
end subroutine polyval

univariate

bivariate

Also, one should be careful with very small sizes, as the time will be too small to have an accurate performance profile. Also, relative times with respect to numpy could be deceiving, as the absolute time with numpy could be very bad. So below are timings for the largest case:

For univariate with nx=220, time is 1.21 s for numpy, and 1.69e-3 s for the custom fortran version. For bivariate with nxny=220, time is 8e-3 s for numpy, and 1.68e-3 s for the custom version. The fact that the time for both univariate and bivariate is the same when the total nxny size is the same is very important, as it supports the fact that the code is performing near the memory bandwidth limit.

Update 3: with the new python script for smaller sizes, and simd=4 I get the following performance:

enter image description here

Update 4: As for correctness, the results are the same within double precision accuracy, which you can see if you run this python code for the univariate example:

import polynomial as P
import numpy.polynomial.polynomial as PP

import numpy as np

for n in xrange(2,100):
    poly1n = np.random.rand(n)
    poly1f = np.asfortranarray(poly1n)

    x = 2

    print "%18.14e" % P.polyval(poly1f, x)
    print "%18.14e" % PP.polyval(x, poly1n)
    print (P.polyval(poly1f, x) - PP.polyval(x, poly1n))/PP.polyval(x,poly1n), '\n'

Question 2

In the bivariate case, p is a two-dimensional array. This means that C vs fortran ordering of arrays are different. By default numpy functions give C ordering, and obviously fortran routines use fortran ordering.

f2py is smart enough to deal with this, and automatically converts between C and fortran format arrays. However, this results in some overhead, which is one of the possible reasons for reduced performance. You can check if this is the cause by manually converting p to fortran type using numpy.asfortranarray outside your timing routine. Of course, for this to be meaningful, in your real use case you want to make sure that your input arrays are in fortran order.

f2py has an option -DF2PY_REPORT_ON_ARRAY_COPY which can warn you any time an array is copied.

If this is not the cause, then you need to consider more in-depth details, such as which fortran compiler you are using, and what sort of optimisations it is applying. Examples of things which could slow you down include allocation of arrays on the heap instead of the stack (with expensive calls to malloc), although I would expect such effects to become less significant for larger array.

Finally, you should consider the possibility that for bivariate fitting, for large N, that the numpy routines are already essentially at optimum efficiency. In such cases, the numpy routine may be spending most of its time running optimised C routines, and the overhead of the python code becomes negligible in comparison. In this case, you would not expect your fortran code to show any significant speedup.

Question 3

I would guess, that your tmp array is getting too large, such, that it requires L2, L3 or even main memory accesses instead of caches. It might be better, to break these loops up and process only chunks of them at once (strip-mining).

Question 4

Your function is very short, so you would get better results by inlining polyval. Also you can avoid the calculation of your indices by simply inverting of the loop:

subroutine polyval2(p, x, y, pval, nx, ny)

    implicit none

    real(8), dimension(nx, ny), intent(in), target :: p
    real(8), intent(in) :: x, y
    real(8), intent(out) :: pval
    integer, intent(in) :: nx, ny
    real(8) :: tmp
    integer :: i, ii

    pval = 1.d0
    do i = ny, 1
        tmp = 1.d0
        do ii = nx, 1
            tmp = tmp*x + p(ii,i)
        end do
        pval = pval*y + tmp
    end do

end subroutine polyval2

With this code I got ~10% shorter execution time for large arrays compared to the original code you posted. (I tested a pure Fortran program with your code Nx=Ny=1000, gfortran -O3 -funroll-loops)

I agree with haraldkl, the sharp drop in performance when the dimensions get too large is very typical for cache/memory access patterns. Strip-mining helps, but I would not encourage to do that yourself. Use compiler flags instead: -floop-strip-mine for gfortran and (included in) -O3 for ifort. Also, try loop unrolling: -funroll-loops for gfortran and ifort.

You can specify those flags with f2py -c --f90flags="...".