Efficient implementation fo Faulhaber's Formula

Question 1

How about using Schultz' (1980) idea, outlined below the double series implementation (mathworld.wolfram.com/PowerSum.html) that you mentioned?

From Wolfram MathWorld:

Schultz (1980) showed that the sum S_p(n) can be found by writing

and solving the system of p+1 equations

enter image description here

obtained for j=0, 1, ..., p (Guo and Qi 1999), where delta (j,p) is the Kronecker delta.

Below is an attempt in Haskell that seems to work. It returns a result for n=10^16, p=1000 in about 36 seconds on my old laptop PC.

{-# OPTIONS_GHC -O2 #-}

import Math.Combinatorics.Exact.Binomial
import Data.Ratio
import Data.List (foldl')

kroneckerDelta a b | a == b    = 1 % 1
                   | otherwise = 0 % 1

g a b = ((-1)^(a - b +1) * choose a b) % 1

coefficients :: Integral a => a -> a -> [Ratio a] -> [Ratio a]
coefficients p j cs
  | j < 0     = cs
  | otherwise = coefficients p (j - 1) (c:cs)
 where
   c = f / g (j + 1) j
   f = foldl h (kroneckerDelta j p) (zip [j + 2..p + 1] cs)
   h accum (i,cf) = accum - g i j * cf

trim r = let n = numerator r
             d = denominator r
             l = div n d
         in (mod l (10^9 + 7),(n - d * l) % d)

s n p = numerator (a % 1 + b) where
 (a,b) = foldl' (\(i',r') (i,r) -> (mod (i' + i) (10^9 + 7),r' + r)) (0,0) 
      (zipWith (\c i ->  trim (c * n^i)) (coefficients p p []) [1..p + 1])

main = print (s (10^16) 1000)

Question 2

I've discovered my own algorithm to calculate the coefficients of the polynomial obtained from Faulhaber's formula; it, its proof and several implementations can be found at github.com/fcard/PolySum. This question inspired me to include a c++ implementation (using the GMP library for arbitrary precision numbers), which, as of the time of writing and minus several usability features, is:

#include <gmpxx.h>
#include <vector>

namespace polysum {
  typedef std::vector<mpq_class> mpq_row;
  typedef std::vector<mpq_class> mpq_column;
  typedef std::vector<mpq_row>   mpq_matrix;

  mpq_matrix make_matrix(size_t n) {
    mpq_matrix A(n+1, mpq_row(n+2, 0));
    A[0] = mpq_row(n+2, 1);

    for (size_t i = 1; i < n+1; i++) {
      for (size_t j = i; j < n+1; j++) {
        A[i][j] += A[i-1][j];
        A[i][j] *= (j - i + 2);
      }
      A[i][n+1] = A[i][n-1];
    }
    A[n][n+1] = A[n-1][n+1];
    return A;
  }

  void reduced_row_echelon(mpq_matrix& A) {
    size_t n = A.size() - 1;
    for (size_t i = n; i+1 > 0; i--) {
      A[i][n+1] /= A[i][i];
      A[i][i] = 1;
      for (size_t j = i-1; j+1 > 0; j--) {
        auto p = A[j][i];
        A[j][i] = 0;
        A[j][n+1] -= A[i][n+1] * p;
      }
    }
  }

  mpq_column sum_coefficients(size_t n) {
    auto A = make_matrix(n);
    reduced_row_echelon(A);

    mpq_column result;
    for (auto row: A) {
      result.push_back(row[n+1]);
    }
    return result;
  }
}

We can use the above like so:

#include <cmath>
#include <gmpxx.h>
#include <polysum.h>

mpq_class power_sum(size_t K, unsigned int N) {
  auto coeffs = polysum::sum_coefficients(K)

  mpq_class result(0);
  for (size_t i = 0; i <= K; i++) {
    result += A[i][n+1] * pow(N, i+1);
  }
  return result;
}

The full implementation provides a Polynomial class that is printable and callable, as well as a polysum function to construct one as a sum of another polynomial.

#include "polysum.h"

void power_sum_print(size_t K, unsigned int N) {
  auto F = polysum::polysum(K);
  std::cout << "polynomial: " << F;
  std::cout << "result: " << F(N);
}

As for efficiency, the above calculates the result for K=1000 and N=1e16 in about 1.75 seconds on my computer, compared to the much more mature and optimized SymPy implementation which takes about 90 seconds on the same machine, and mathematica which takes 30 seconds. For K=3000 the above takes about 4 minutes, mathematica took almost 20 minutes, (but uses much less memory) and I left sympy running all night but it didn't finish, maybe due to it running out of memory.

Among the optimizations that can be done here are making the matrix sparse and taking advantage of the fact that only half of the rows and columns need to be calculated. The Rust version in the linked repository implements the sparse and rows optimizations, and takes about 0.7 seconds to calculate K=1000, and about 45 to calculate K=3000 (using 105mb and 2.9gb of memory respectively). The Haskell version implements all three optimizations and takes about 1 second for K=1000 and about 34 seconds for K=3000. (using 60mb and 880mb of memory respectively) and The completely unoptimized python implementation takes about 12 seconds for K=1000 but runs out of memory for K=3000.

It's looking like this method is the fastest regardless of the language used, but the research is ongoing. Since Schultz's method also boils down to solving a system of n+1 equations and should be able to be optimized the same way, it will depend on whether his matrix is faster to calculate or not. Also, memory usage is not scaling well at all, and Mathematica is still the clear winner here, using only 80mb for K=3000. We'll see.