Can I paralellize tiny sub-10ms tasks? (e.g. matrix inversion)

https://stackoverflow.com/questions/13988247

11-12-2021
|

Question

I have a general question about threads (or fibers) on Windows and Linux, in any programming language:

Is it possible to have a "2nd thread standing by" and have it quickly jump into action and help with a small task for a few milliseconds, without getting pre-empted? I should add that I am hoping for legible code without mutexes and spin locks.

To show that conventional thread pools do not work for small tasks, consider a matrix inversion problem in C#. I am using Ivan Kuckir's matrix class. I copy his Invert function and call it InvertParallel as follows:

public Matrix InvertParallel()   // modified from Ivan's Invert(), see link above
{
    if (L == null) MakeLU();
    Matrix inv = new Matrix(rows, cols);
    Parallel.ForEach<int>(Enumerable.Range(0, rows), rowID =>
    {
        Matrix Ei = Matrix.ZeroMatrix(rows, 1);
        Ei[rowID, 0] = 1;
        Matrix col = SolveWith(Ei);
        inv.SetCol(col, rowID);
    });
    return inv;
}

Then I invoke both Invert and Invert_Parallel functions and measure time it takes:

static void Main()
{
    Stopwatch sw = new Stopwatch(); sw.Start();
    Matrix A = Matrix.IdentityMatrix(50, 50);
    Matrix I1 = A.Invert();
    long elapsed1 = sw.ElapsedMilliseconds;
    Matrix I2 = A.InvertParallel();
    long elapsed2 = sw.ElapsedMilliseconds - elapsed1;
    Console.WriteLine("Matrix size=" + A.rows + ", Invert=" + elapsed1 + "ms, Invert_Parallel=" + elapsed2 + "ms");
}

The fairly obvious results show that for small tasks (matrix size 50), launching tasks from a thread pool is slower than running single-threaded:

Matrix size=50,  Invert=  5ms,  InvertParallel=21ms
Matrix size=100, Invert= 19ms,  InvertParallel=24ms
Matrix size=200, Invert=137ms,  InvertParallel=44ms

(Apologies - I had to remove all links from the notes below, as I am a new poster)

P.S. possibly related S/O articles: 'how to start two threads at the same time', 'Linux - threads and process scheduling priorities'

P.S. Yes I know there is a more clever algorithm for calculating the inverse, which scales as O(N^2.376).

P.P.S. I am not familiar with User-Mode Scheduling (UMS on Windows) User-Mode Scheduling ('UMS' on Windows), can it help?

Solution

Passing tasks between threads, especially those which have a significant amount of data associated with them has a non-trivial overhead which can result in slower performance for small tasks as you have seen. You need to find the sweet spot where using more than one core is faster.

BTW you can get a small improvement by making one of the worker threads, the current thread saving one context switch and copies between caches. e.g. if you have 8 cores, pass 1/7 to 7 other threads and do 1/8 in the current thread.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow