Following the other suggestions, using p=np.asfortranarray(p)
before the timer indeed puts the performance on par with numpy when I tested it. I extended the range for the bivariate bench to n_bi = np.array([2**i for i in xrange(1, 15)])
, so that the p matrix would be larger than my L3 cache size.
To further optimize this, I don't think automatic compiler options will be much help, since the inner loop has a dependency. Only if you manually unroll it, does ifort
vectorize the innermost loop. With gfortran
, -O3
and -ffast-math
were needed. For matrix sizes limited by main memory bandwidth, this increase the performance benefit over numpy from a factor of 1 to 3.
Update: after applying this also to the univariate code and compiling with f2py --opt='-O3 -ffast-math' -c -m polynomial polynomial.f90
, I get the following for the source and results for benchmark.py:
subroutine polyval(p, x, pval, nx)
implicit none
real*8, dimension(nx), intent(in) :: p
real*8, intent(in) :: x
real*8, intent(out) :: pval
integer, intent(in) :: nx
integer, parameter :: simd = 8
real*8 :: tmp(simd), vecx(simd), xfactor
integer :: i, j, k
! precompute factors
do i = 1, simd
vecx(i)=x**(i-1)
end do
xfactor = x**simd
tmp = 0.0d0
do i = 1, nx, simd
do k = 1, simd
tmp(k) = tmp(k)*xfactor + p(nx-(i+k-1)+1)*vecx(simd-k+1)
end do
end do
pval = sum(tmp)
end subroutine polyval
subroutine polyval2(p, x, y, pval, nx, ny)
implicit none
real*8, dimension(nx, ny), intent(in) :: p
real*8, intent(in) :: x, y
real*8, intent(out) :: pval
integer, intent(in) :: nx, ny
integer, parameter :: simd = 8
real*8 :: tmp(simd), vecx(simd), xfactor
integer :: i, j, k
! precompute factors
do i = 1, simd
vecx(i)=x**(i-1)
end do
xfactor = x**simd
! horner
pval=0.0d0
do i = 1, ny
tmp = 0.0d0
do j = 1, nx, simd
! inner vectorizable loop
do k = 1, simd
tmp(k) = tmp(k)*xfactor + p(nx-(j+k-1)+1,ny-i+1)*vecx(simd-k+1)
end do
end do
pval = pval*y + sum(tmp)
end do
end subroutine polyval2
Update 2: As pointed out, this code is not correct, at least when sizes are not divisible by simd
. It's just showing the concept of manually helping the compiler, so don't just use it like this. If the sizes are not powers of two, a small remainder loop has to take care of the dangling indices. It's not so difficult to do this, here is the correct procedure for the univariate case, should be straightforward to extend it to bivariate:
subroutine polyval(p, x, pval, nx)
implicit none
real*8, dimension(nx), intent(in) :: p
real*8, intent(in) :: x
real*8, intent(out) :: pval
integer, intent(in) :: nx
integer, parameter :: simd = 4
real*8 :: tmp(simd), vecx(simd), xfactor
integer :: i, j, k, nr
! precompute factors
do i = 1, simd
vecx(i)=x**(i-1)
end do
xfactor = x**simd
! check remainder
nr = mod(nx, simd)
! horner
tmp = 0.0d0
do i = 1, nx-nr, simd
do k = 1, simd
tmp(k) = tmp(k)*xfactor + p(nx-(i+k-1)+1)*vecx(simd-k+1)
end do
end do
pval = sum(tmp)
! do remainder
pval = pval * x**nr
do i = 1, nr
pval = pval + p(i) * vecx(i)
end do
end subroutine polyval
Also, one should be careful with very small sizes, as the time will be too small to have an accurate performance profile. Also, relative times with respect to numpy
could be deceiving, as the absolute time with numpy could be very bad. So below are timings for the largest case:
For univariate with nx=220, time is 1.21 s for numpy, and 1.69e-3 s for the custom fortran version. For bivariate with nxny=220, time is 8e-3 s for numpy, and 1.68e-3 s for the custom version. The fact that the time for both univariate and bivariate is the same when the total nxny size is the same is very important, as it supports the fact that the code is performing near the memory bandwidth limit.
Update 3: with the new python script for smaller sizes, and simd=4
I get the following performance:
Update 4: As for correctness, the results are the same within double precision accuracy, which you can see if you run this python code for the univariate example:
import polynomial as P
import numpy.polynomial.polynomial as PP
import numpy as np
for n in xrange(2,100):
poly1n = np.random.rand(n)
poly1f = np.asfortranarray(poly1n)
x = 2
print "%18.14e" % P.polyval(poly1f, x)
print "%18.14e" % PP.polyval(x, poly1n)
print (P.polyval(poly1f, x) - PP.polyval(x, poly1n))/PP.polyval(x,poly1n), '\n'