Cubic root function cbrt() in Visual Studio 2012

Question 1

This site explores several computational methods to compute cube root efficiently in C, and has some source code you can download.

(EDIT: A google search for "fast cube root" comes up with several more promising-looking hits.)

Cube roots are a topic of interest, because they're used in many common formulae and a fast cube root function isn't included with Microsoft Visual Studio.

In the absence of a special cube root function, a typical strategy is calculation via a power function (e.g., pow(x, 1.0/3.0)). This can be problematic in terms of speed and in terms of accuracy when negative numbers aren't handled properly.

His site has some benchmarks on the methods used. All of them are much faster than pow().

32-bit float tests
----------------------------------------
cbrt_5f      8.8 ms    5 mbp   6.223 abp
pow        144.5 ms   23 mbp  23.000 abp
halley x 1  31.8 ms   15 mbp  18.961 abp
halley x 2  59.0 ms   23 mbp  23.000 abp
newton x 1  23.4 ms   10 mbp  12.525 abp
newton x 2  48.9 ms   20 mbp  22.764 abp
newton x 3  72.0 ms   23 mbp  23.000 abp
newton x 4  89.6 ms   23 mbp  23.000 abp

See the site for downloadable source.

Question 2

Implementation below is 4x faster than std::pow with relatively higher tolerance (0.000001) on an AVX-512 CPU. It is made of vertically auto-vectorizable loops for every basic operation like multiplication and division so that it computes 8,16,32 elements at once instead of horizontally vectorizing the Newton-Raphson loop.

#include <cmath> 
/* 
   Newton-Raphson iterative solution
   f_err(x) = x*x*x - N
   f'_err(x) = 3*x*x
   x = x - (x*x*x - N)/(3*x*x)
   x = x - (x - N/(x*x))/3   <--- repeat until error < tolerance
   but with vertical-parallelization 
*/
template < typename Type, int Simd, int inverseTolerance> 
    inline 
void cubeRootFast(Type *const __restrict__ data, 
        Type *const __restrict__ result) noexcept 
{ 
    // alignment 64 required for AVX512 vectorization
    alignas(64) 
    Type xd[Simd]; 

    alignas(64) 
    Type resultData[Simd]; 

    alignas(64) 
    Type xSqr[Simd]; 

    alignas(64) 
    Type nDivXsqr[Simd]; 

    alignas(64) 
    Type diff[Simd]; 

    // cube root checking mask
    for (int i = 0; i < Simd; i++) 
    { 
        xd[i] = data[i] <= Type(0.000001);
    } 

    // skips division by zero if input is zero or close to zero
    for (int i = 0; i < Simd; i++) 
    { 
        resultData[i] = xd[i] ? Type(1.0) : data[i]; 
    } 

    // Newton-Raphson Iterations in parallel 
    bool work = true; 
    while (work) 
    { 
        // compute x*x
        for (int i = 0; i < Simd; i++) 
        { 
            xSqr[i] = resultData[i] *resultData[i]; 
        } 

        // compute N/(x*x)
        for (int i = 0; i < Simd; i++) 
        { 
            nDivXsqr[i] = data[i] / xSqr[i]; 
        } 

        // compute x - N/(x*x)
        for (int i = 0; i < Simd; i++) 
        { 
            nDivXsqr[i] = resultData[i] - nDivXsqr[i]; 
        } 

        // compute (x-N/(x*x))/3
        for (int i = 0; i < Simd; i++) 
        { 
            nDivXsqr[i] = nDivXsqr[i] / Type(3.0); 
        } 

        // compute x - (x-N/(x*x))/3
        for (int i = 0; i < Simd; i++) 
        { 
            diff[i] = resultData[i] - nDivXsqr[i]; 
        } 

        // compute error
        for (int i = 0; i < Simd; i++) 
        { 
            diff[i] = resultData[i] - diff[i]; 
        } 

        // compute absolute error
        for (int i = 0; i < Simd; i++) 
        { 
            diff[i] = std::abs(diff[i]); 
        } 

        // compute condition to stop looping (error < tolerance)?
        for (int i = 0; i < Simd; i++) 
        { 
            diff[i] = diff[i] > Type(1.0/inverseTolerance); 
        } 

        // all SIMD lanes have to have zero work left to end
        Type check = 0; 
        for (int i = 0; i < Simd; i++) 
        { 
            check += diff[i]; 
        } 
        work = (check > Type(0.0)); 

        // compute the next x guess
        for (int i = 0; i < Simd; i++) 
        { 
            resultData[i] = resultData[i] - nDivXsqr[i]; 
        } 
    } 

    // if input was close to zero, output zero
    // output result otherwise
    for (int i = 0; i < Simd; i++) 
    { 
        result[i] = xd[i] ? Type(0.0) : resultData[i]; 
    } 
} 
#include <iostream> 
 
int main() 
{ 
    constexpr int n = 8192; 
    constexpr int simd = 16; 
    constexpr int inverseTolerance = 1000;
    float data[n]; 
    for (int i = 0; i < n; i++) 
    { 
        data[i] = i; 
    } 
    for (int i = 0; i < n; i += simd) 
    { 
        cubeRootFast<float, simd, inverseTolerance> (data + i, data + i); 
    } 
    for (int i = 0; i < 10; i++) 
        std::cout << data[i *i *i] << std::endl; 
    return 0; 
}

It is tested only with GCC so it may require extra MSVC-pragmas on each loop to force auto-vectorization. If you have OpenMP, then you can also use #pragma omp simd safelen(Simd) to achieve same thing.

The performance only holds within [0,1] range. To use bigger values, you should use range reduction like this:

// example: max value is 1000
for(auto & input:inputs)
    input = input/1000.0f // normalize

for(..)
    cubeRootFast<float, simd, inverseTolerance> (input + i, input + i)

 for(auto & input:inputs)
    input = 10.0f*input // de-normalize  (1000 = 10 x 10 x 10)

If you need only 0.005 error on low-range like [0,1000] with 16x speedup, you can try below implementation that uses polynomial approximation (Horner-Scheme is applied to compute with FMA instructions and no explicit auto-vectorization required since it doesn't include any branching/loop inside):

// optimized for [0,1] range: ~1 cycles on AVX512, 0.003 average error 
// polynomial approximation with Horner Scheme for FMA optimization 
template<typename T> 
T cubeRootFast(T x) 
{ 
 T xd = x-T(1.0); 
 T result = T(-55913.0/4782969.0); 
 result *= xd; 
 result += T(21505.0/1594323.0);  
 result *= xd; 
 result += T(-935.0/59049.0);  
 result *= xd; 
 result += T(374.0/19683.0);  
 result *= xd; 
 result += T(-154.0/6561.0);  
 result *= xd; 
 result += T(22.0/729.0);  
 result *= xd; 
 result += T(-10.0/243.0); 
 result *= xd; 
 result += T(5.0/81.0); 
 result *= xd; 
 result += T(-1.0/9.0);  
 result *= xd; 
 result += T(1.0/3.0); 
 result *= xd; 
 result += T(1.0); 
 return result; 
} 
 
// range reduction + dereduction: ~ 1 cycles on AVX512 
 for(int i=0;i<8192;i++) 
 {  
     float inp = input[i]; 
     // scaling + descaling for range [1,999] 
     float scaling = (inp>333.0f)?(1000.0f):(333.0f); 
     scaling = (inp>103.0f)?scaling:(103.0f); 
     scaling = (inp>29.0f)?scaling:(29.0f); 
     scaling = (inp>7.0f)?scaling:(7.0f); 
     scaling = (inp>3.0f)?scaling:(3.0f); 
      
     output[i] = powf(scaling,0.33333333333f)*cubeRootFast<float>(inp/scaling);  
 }