How to do Speedy Complex Arithmetic in C#

https://stackoverflow.com/questions/5010975

14-11-2019
|

문제

I'm working on a C# Fractal Generator project right now that requires lots of Arithmetic with Complex numbers, and I'm trying to think of ways to speed up the math. Below is a simplified set of code that tests the speed of a Mandelbrot calculation using one of three data storage methods, shown in TestNumericsComplex, TestCustomComplex, and TestPairedDoubles. Please understand that the Mandelbrot is just an example, and I intend for future developers to be able to create plug-in fractal formulas.

Basically I see that using System.Numerics.Complex is an ok idea, while using a pair of doubles or a custom Complex struct are passable ideas. I can perform the arithmetic using the gpu, but wouldn't that limit or break portability? I've tried varying the order of the inner loops (i, x, y) to no avail. What else can I do to help speed up the inner loops? Am I running into page fault issues? Would using a fixed-point number system gain me any speed as opposed to the floating-point values?

I'm already aware of Parallel.For in C# 4.0; it is omitted from my code samples for clarity. I'm also aware that C# is not usually a good language for high-performance; I'm using C# to take advantage of Reflection for plugins and WPF for windowing.

using System;
using System.Diagnostics;

namespace SpeedTest {
class Program {
    private const int ITER = 512;
    private const int XL = 1280, YL = 1024;

    static void Main(string[] args) {
        var timer = new Stopwatch();
        timer.Start();
        //TODO use one of these two lines
        //TestCustomComplex();
        //TestNumericsComplex();
        //TestPairedDoubles();
        timer.Stop();
        Console.WriteLine(timer.ElapsedMilliseconds);
        Console.ReadKey();
    }

    /// <summary>
    /// ~14000 ms on my machine
    /// </summary>
    static void TestNumericsComplex() {
        var vals = new System.Numerics.Complex[XL,YL];
        var loc = new System.Numerics.Complex[XL,YL];

        for (int x = 0; x < XL; x++) for (int y = 0; y < YL; y++) {
            loc[x, y] = new System.Numerics.Complex((x - XL/2)/256.0, (y - YL/2)/256.0);
            vals[x, y] = new System.Numerics.Complex(0, 0);
        }

        for (int i = 0; i < ITER; i++) {
            for (int x = 0; x < XL; x++)
            for (int y = 0; y < YL; y++) {
                if(vals[x,y].Real>4) continue;
                vals[x, y] = vals[x, y] * vals[x, y] + loc[x, y];
            }
        }
    }


    /// <summary>
    /// ~17000 on my machine
    /// </summary>
    static void TestPairedDoubles() {
        var vals = new double[XL, YL, 2];
        var loc = new double[XL, YL, 2];

        for (int x = 0; x < XL; x++) for (int y = 0; y < YL; y++) {
                loc[x, y, 0] = (x - XL / 2) / 256.0;
                loc[x, y, 1] = (y - YL / 2) / 256.0;
                vals[x, y, 0] = 0;
                vals[x, y, 1] = 0;
            }

        for (int i = 0; i < ITER; i++) {
            for (int x = 0; x < XL; x++)
                for (int y = 0; y < YL; y++) {
                    if (vals[x, y, 0] > 4) continue;
                    var a = vals[x, y, 0] * vals[x, y, 0] - vals[x, y, 1] * vals[x, y, 1];
                    var b = vals[x, y, 0] * vals[x, y, 1] * 2;
                    vals[x, y, 0] = a + loc[x, y, 0];
                    vals[x, y, 1] = b + loc[x, y, 1];
                }
        }
    }


    /// <summary>
    /// ~16900 ms on my machine
    /// </summary>
    static void TestCustomComplex() {
        var vals = new Complex[XL, YL];
        var loc = new Complex[XL, YL];

        for (int x = 0; x < XL; x++) for (int y = 0; y < YL; y++) {
            loc[x, y] = new Complex((x - XL / 2) / 256.0, (y - YL / 2) / 256.0);
            vals[x, y] = new Complex(0, 0);
        }

        for (int i = 0; i < ITER; i++) {
            for (int x = 0; x < XL; x++)
            for (int y = 0; y < YL; y++) {
                if (vals[x, y].Real > 4) continue;
                vals[x, y] = vals[x, y] * vals[x, y] + loc[x, y];
            }
        }
    }

}

public struct Complex {
    public double Real, Imaginary;
    public Complex(double a, double b) {
        Real = a;
        Imaginary = b;
    }
    public static Complex operator + (Complex a, Complex b) {
        return new Complex(a.Real + b.Real, a.Imaginary + b.Imaginary);
    }
    public static Complex operator * (Complex a, Complex b) {
        return new Complex(a.Real*b.Real - a.Imaginary*b.Imaginary, a.Real*b.Imaginary + a.Imaginary*b.Real);
    }
}

}

EDIT

GPU seems to be the only feasible solution; I disregard interoperability with C/C++ because I don't feel the speed up would be significant enough to coerce me to forcing interoperability on future plugins.

After looking into the available GPU options (which I've actually been examining for some time now), I've finally found what I believe is an excellent compromise. I've chosen OpenCL with the hope that most devices will support the standard by the time my program is released. OpenCLTemplate uses cloo to provide an easy-to-understand interface between .Net (for application logic) and "OpenCL C99" (for parallel code). Plugins can include OpenCL kernels for hardware acceleration alongside the standard implementation with System.Numerics.Complex for ease of integration.

I expect the number of available tutorials on writing OpenCL C99 code to grow rapidly as the standard becomes adopted by processor vendors. This keeps me from needing to enforce GPU coding on plugin developers while providing them with a well formulated language should they choose to take advantage of the option. It also means that IronPython scripts will have equal access to GPU acceleration despite being unknown until compile-time, since the code will translate directly through OpenCL.

For anyone in the future interested in integrating GPU acceleration with a .Net project, I highly recommend OpenCLTemplate. There is an admitted overhead of learning OpenCL C99. However, it is only slightly harder than learning an alternative API and will likely have better support from examples and general communities.

해결책

I think your best bet is to look at off loading these calculations to a graphics card. There is openCL that can use graphics cards for this sort of thing, as well as using openGL shaders.

To really take advantage of this, you want to be calculating in parallel. lets say you are wanting to square root (simple I know but the principle is the same) 1 million numbers. On a CPU you can only do one at a time, or work out how many cores you have, reasonable to expect say 8 cores, and have each perform the calculation on a subset of the data.

If you offload your calculation to a graphics card for example, you would 'feed' in you data as say, a bunch of 1/4 million 3D points in space (that's four floats per vertex) and then have a vertex shader calculate the square root of each xyzw of each vertex. a graphics cards has a hell of a lot more cores, even if it was only 100 it can still work on a lot more numbers at once then a CPU.

I can flesh this out with some more info if you want, though I am no expect on use of shaders, but I need to get up to scratch with them any way.

EDIT

looking at this relativeley cheap card an nvidea GT 220 you can see it has 48 'CUDA' cores. These are what you are using when you use things like openCL and shaders.

EDIT 2

Ok, so it seems your fairly interested in using GPU acceleration. I can't help you with using openCL, never looked into it, but I assume it will work much the same openGL/DirectX applications that make use of shaders but with out the actual graphics application. I'm going to talk about the DirectX way of things, as that is what I know (just about) but from my understanding, it is more or less the same all the way for openGL.

First, you need to create a window. as you want cross platform, GLUT is probably the best way to go, its not the best library in the world, but it gives you a window nice and fast. As you are not going to actually show any rendering, you could just make it a tiny window, just big enough to set he title to something like "HARDWARE ACCELERATING".

Once you have your graphics card set up and ready to render stuff with, you get to this stage by following tutorials from here. This will get you to the stage where you can create 3D models and 'animate' them on screen.

Next you want to create a vertex buffer that you populate with input data. a vertex would normally be three (or four) floats. If you values are all independent, that's cool. but if you need to group them together, say if you are in fact working with 2D vectors, then you need to make sure you 'pack' the data correctly. say you want to do maths with 2D vectors, and openGL is working with 3D vectors, then vector.x and vector.y are your actually input vector and vector.z would just be spare data.

You see, the vector shader can only work with one vector at a time, it can't see more then one vector as input, you could look into using a geometry shader which can look at bigger sets of data.

So right, you set up an vertex buffer and pop that over the graphics card. You also need to write a 'vertex shader', this is a text file with a sort of C like language that lets you perform some maths. It is not a full C implementation mind, but it looks enough like C for you to know what your doing. The exact ins and outs of openGL shaders is beyond me, but I am sure a simple tutorial is easy enough to find.

One thing that you are on your own with, is finding out how exactly you can get the output of the vertex shader to go to a second buffer, which is effectively your output. A vertex shader does not change the vertex data in the buffer you set up, that is constant (as far as the shader is concerned) but you can get the shader to output to a second buffer.

your calculation would look something like this

createvertexbuffer()
loadShader("path to shader code", vertexshader) // something like this I think
// begin 'rendering'
setShader(myvertexshader)
setvertexbuffer(myvertexbuffer)
drawpoints() // will now 'draw' your points
readoutputbuffer()

I hope this helps. Like I said, I am still learning this, and even then I am learning the DirectX way of things.

다른 팁

Making your custom struct mutable I gained 30%. This reduces calls and memory usage

//instead of writing  (in TestCustomComplex())
vals[x, y] = vals[x, y] * vals[x, y] + loc[x, y];

//use
vals[x,y].MutableMultiAdd(loc[x,y]);

//defined in the struct as
public void MutableMultiAdd(Complex other)
    {
        var tempReal = (Real * Real - Imaginary * Imaginary) + other.Real;
        Imaginary =( Real * Imaginary + Imaginary * Real )+ other.Imaginary;
        Real = tempReal;
    }

For Matrix Multiply you can also use 'Unsafe { Fixed(){}}' and access your arrays. Using this I gained 15% for TestCustomComplex().

private static void TestCustomComplex()
    {
        var vals = new Complex[XL, YL];
        var loc = new Complex[XL, YL];

        for (int x = 0; x < XL; x++)
            for (int y = 0; y < YL; y++)
            {
                loc[x, y] = new Complex((x - XL / 2) / 256.0, (y - YL / 2) / 256.0);
                vals[x, y] = new Complex(0, 0);
            }

        unsafe
        {
            fixed (Complex* p = vals, l = loc)
            {
                for (int i = 0; i < ITER; i++)
                {
                    for (int z = 0; z < XL*YL; z++)
                    {
                        if (p[z].Real > 4) continue;
                        p[z] = p[z] * p[z] + l[z];
                    }
                }
            }
        }
    }

Personally, if this is a major issue, I would create a C++ dll and then use that to do the arithmetic. You can call this plugin from C# so you can still take advantage of WPF and reflection etc.

One thing to note is that calling the plugin isn't exactly a "fast", so you want to ensure you pass ALL your data in one go and not call it very often.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow