so I am currently trying to do something like A**b for some 2d ndarray and a double b in parallel for Python. I would like to do it with a C extension using OpenMP (yes I know, there is Cython etc. but at some point I always ran into trouble with those 'high-level' approaches...).
So here is the gaussian.c Code for my gaussian.so:
void scale(const double *A, double *out, int n) {
int i, j, ind1, ind2;
double power, denom;
power = 10.0 / M_PI;
denom = sqrt(M_PI);
#pragma omp parallel for
for (i = 0; i < n; i++) {
for (j = i; j < n; j++) {
ind1 = i*n + j;
ind2 = j*n + i;
out[ind1] = pow(A[ind1], power) / denom;
out[ind2] = out[ind1];
}
}
(A is a square double Matrix, out has the same shape and n is the number of rows/columns) So the point is to update some symmetric distance matrix - ind2 is the transposed index of ind1.
I compile it using gcc -shared -fopenmp -o gaussian.so -lm gaussian.c
. I access the function directly via ctypes in Python:
test = c_gaussian.scale
test.restype = None
test.argtypes = [ndpointer(ctypes.c_double,
ndim=2,
flags='C_CONTIGUOUS'), # array of sample
ndpointer(ctypes.c_double,
ndim=2,
flags='C_CONTIGUOUS'), # array of sampl
ctypes.c_int # number of samples
]
The function 'test' is working smoothly as long as I comment the #pragma line - otherwise it ends with error number 139.
A = np.random.rand(1000, 1000) + 2.0
out = np.empty((1000, 1000))
test(A, out, 1000)
When I change the inner loop to just print ind1 and ind2 it runs smoothly in parallel. It also works, when I just access the ind1 location and leave ind2 alone (even in parallel)! Where do I screw up the memory access? How can I fix this?
thank you!
Update: Well I guess this is running into the GIL, but I am not yet sure...
Update: Okay, I am pretty sure now, that it is evil GIL killing me here, so I altered the example:
I now have gil.c:
#include <Python.h>
#define _USE_MATH_DEFINES
#include <math.h>
void scale(const double *A, double *out, int n) {
int i, j, ind1, ind2;
double power, denom;
power = 10.0 / M_PI;
denom = sqrt(M_PI);
Py_BEGIN_ALLOW_THREADS
#pragma omp parallel for
for (i = 0; i < n; i++) {
for (j = i; j < n; j++) {
ind1 = i*n + j;
ind2 = j*n + i;
out[ind1] = pow(A[ind1], power) / denom;
out[ind2] = out[ind1];
}
}
Py_END_ALLOW_THREADS
}
which is compiled using gcc -shared -fopenmp -o gil.so -lm gil.c -I /usr/include/python2.7 -L /usr/lib/python2.7/ -lpython2.7
and the corresponding Python file:
import ctypes
import numpy as np
from numpy.ctypeslib import ndpointer
import pylab as pl
path = '../src/gil.so'
c_gil = ctypes.cdll.LoadLibrary(path)
test = c_gil.scale
test.restype = None
test.argtypes = [ndpointer(ctypes.c_double,
ndim=2,
flags='C_CONTIGUOUS'),
ndpointer(ctypes.c_double,
ndim=2,
flags='C_CONTIGUOUS'),
ctypes.c_int
]
n = 100
A = np.random.rand(n, n) + 2.0
out = np.empty((n,n))
test(A, out, n)
This gives me
Fatal Python error: PyEval_SaveThread: NULL tstate
Process finished with exit code 134
Now somehow it seems to not be able to save the current thread - but the API doc does not go into detail here, I was hoping that I could ignore Python when writing my C function, but this seems to be quite messy :( any ideas? I found this very helpful: GIL