OpenMP, Python, C Extension, Memory Access and the evil GIL

https://stackoverflow.com/questions/21523660

06-10-2022
|

Question

so I am currently trying to do something like A**b for some 2d ndarray and a double b in parallel for Python. I would like to do it with a C extension using OpenMP (yes I know, there is Cython etc. but at some point I always ran into trouble with those 'high-level' approaches...).

So here is the gaussian.c Code for my gaussian.so:

void scale(const double *A, double *out, int n) {
    int i, j, ind1, ind2;
    double power, denom;
    power = 10.0 / M_PI;
    denom = sqrt(M_PI);

    #pragma omp parallel for
    for (i = 0; i < n; i++) {
        for (j = i; j < n; j++) {
            ind1 = i*n + j;
            ind2 = j*n + i;
            out[ind1] = pow(A[ind1], power) / denom;
            out[ind2] = out[ind1];
        }
    }

(A is a square double Matrix, out has the same shape and n is the number of rows/columns) So the point is to update some symmetric distance matrix - ind2 is the transposed index of ind1.

I compile it using gcc -shared -fopenmp -o gaussian.so -lm gaussian.c. I access the function directly via ctypes in Python:

test = c_gaussian.scale
test.restype = None
test.argtypes = [ndpointer(ctypes.c_double,
                           ndim=2,
                           flags='C_CONTIGUOUS'), # array of sample
                 ndpointer(ctypes.c_double,
                           ndim=2,
                           flags='C_CONTIGUOUS'), # array of sampl
                 ctypes.c_int # number of samples
                 ]

The function 'test' is working smoothly as long as I comment the #pragma line - otherwise it ends with error number 139.

A = np.random.rand(1000, 1000) + 2.0
out = np.empty((1000, 1000))
test(A, out, 1000)

When I change the inner loop to just print ind1 and ind2 it runs smoothly in parallel. It also works, when I just access the ind1 location and leave ind2 alone (even in parallel)! Where do I screw up the memory access? How can I fix this?

thank you!

Update: Well I guess this is running into the GIL, but I am not yet sure...

Update: Okay, I am pretty sure now, that it is evil GIL killing me here, so I altered the example:

I now have gil.c:

#include <Python.h>
#define _USE_MATH_DEFINES
#include <math.h>

void scale(const double *A, double *out, int n) {
    int i, j, ind1, ind2;
    double power, denom;
    power = 10.0 / M_PI;
    denom = sqrt(M_PI);
    Py_BEGIN_ALLOW_THREADS
    #pragma omp parallel for
    for (i = 0; i < n; i++) {
        for (j = i; j < n; j++) {
            ind1 = i*n + j;
            ind2 = j*n + i;
            out[ind1] = pow(A[ind1], power) / denom;
            out[ind2] = out[ind1];
        }
    }
    Py_END_ALLOW_THREADS
}

which is compiled using gcc -shared -fopenmp -o gil.so -lm gil.c -I /usr/include/python2.7 -L /usr/lib/python2.7/ -lpython2.7 and the corresponding Python file:

import ctypes
import numpy as np
from numpy.ctypeslib import ndpointer
import pylab as pl

path = '../src/gil.so'
c_gil = ctypes.cdll.LoadLibrary(path)

test = c_gil.scale
test.restype = None
test.argtypes = [ndpointer(ctypes.c_double,
                           ndim=2,
                           flags='C_CONTIGUOUS'),
                 ndpointer(ctypes.c_double,
                           ndim=2,
                           flags='C_CONTIGUOUS'),
                 ctypes.c_int
                 ]

n = 100
A = np.random.rand(n, n) + 2.0
out = np.empty((n,n))

test(A, out, n)

This gives me

Fatal Python error: PyEval_SaveThread: NULL tstate

Process finished with exit code 134

Now somehow it seems to not be able to save the current thread - but the API doc does not go into detail here, I was hoping that I could ignore Python when writing my C function, but this seems to be quite messy :( any ideas? I found this very helpful: GIL

Solution

Your problem is much simpler than you think and does not involve GIL in any way. You are running in an out-of-bound access to out[] when you access it via ind2 since j easily becomes larger than n. The reason is simply that you have not applied any data sharing clause to your parallel region and all variables except i remain shared (as per default in OpenMP) and therefore subject to data races - in that case multiple simultaneous increments being done by the different threads. Having too large j is less of a problem with ind1, but not with ind2 since there the too large value is multiplied by n and thus becomes far too large.

Simply make j, ind1 and ind2 private as they should be:

#pragma omp parallel for private(j,ind1,ind2)
for (i = 0; i < n; i++) {
    for (j = i; j < n; j++) {
        ind1 = i*n + j;
        ind2 = j*n + i;
        out[ind1] = pow(A[ind1], power) / denom;
        out[ind2] = out[ind1];
    }
}

Even better, declare them inside the scope where they are being used. That automatically makes them private:

#pragma omp parallel for
for (i = 0; i < n; i++) {
    int j;
    for (j = i; j < n; j++) {
        int ind1 = i*n + j;
        int ind2 = j*n + i;
        out[ind1] = pow(A[ind1], power) / denom;
        out[ind2] = out[ind1];
    }
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow