Using the notation for your kernel for the first part of your code you get this.
for (row=0; row<N; row++) {
for (col=0; col<N; col++) {
for (n=0; n<N; n++) {
temp=mat[row*N+n] && mat[n*N+col];
B[row*N+col] = B[row*N+col] || temp;
}
}
}
So your kernel should be something like this:
__global__ void gpu_booleanMM(char *mat, char *B, int N)
{
int row = blockIdx.y*blockDim.y + threadIdx.y;
int col = blockIdx.x*blockDim.x + threadIdx.x;
for (int n=0; n<N; n++) {
temp=mat[row*N+n] && mat[n*N+col];
B[row*N+col] = B[row*N+col] || temp;
}
}
I doubt this is very efficient but something like this should nevertheless give the correct result.