سؤال

everyone: Recently I have tried to utilize the most recent property of cuda 5.5 to program, i.e., dynamic parallelism. But I have some very confusing problem. My code is here:

    /* Includes, system */
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <iostream>
    using namespace std;
    /* Includes, cuda */
    #include <cuda_runtime.h>
    #include <cublas_v2.h>
    
    /* Includes, cuda helper functions */
    #include <helper_cuda.h>
    
    #include "kernels.cu"
    /* Matrix size */
    #define N  (275)
    
    #define LengthSignal (64)
    
    #define AmountSignal (255025)
    
    #define NBLOCKX (32768)
    
    #define NTHREADS_PER_BLOCK (128)
    /* Declaration of the function that computes sgemm using CUBLAS device API */
    
    __global__ void invokeDeviceCublasSgemm(float *d_A, float *Test);
    
    /* Main */
    int main(int argc, char **argv)
    {
      float *h_A;
      float *d_A = 0;
      int n2 = N * N;
    
      h_A = (float *)malloc(n2 * sizeof(h_A[0]));
  /* Fill the matrices with test data */
  for (int i = 0; i < n2; i++)
    {
      h_A[i] = rand() / (float)RAND_MAX;
    }

      cudaMalloc((void **)&d_A, n2 * sizeof(h_A[0]));
    
      /* Initialize the device matrices with the host matrices */
      //  cudaMemcpy(d_A, h_A, sizeof(float) * LengthSignal * AmountSignal, cudaMemcpyHostToDevice);
      cudaMemcpy(d_A, h_A, n2 * sizeof(h_A[0]), cudaMemcpyHostToDevice);
    
      int Length = 100;
      float *h_Test = (float *) malloc(sizeof(float) * Length);
      float *d_Test;
      cudaMalloc((void **) &d_Test, sizeof(float) * Length);
      cudaMemset(d_Test, 0, sizeof(float) * Length);

  invokeDeviceCublasSgemm<<<NBLOCKX, NTHREADS_PER_BLOCK>>>(d_A, d_Test);
  cudaMemcpy(h_Test, d_Test, sizeof(float) * Length, cudaMemcpyDeviceToHost);

  printf("\n The first 10 elements of d_A in location 1 are: \n");
  for (int j = 0; j < 10; j ++)
    {
      printf("%f ", h_Test[j]);
    }

  printf("\n The first 10 elements of d_A in location 2 are: \n");
  for (int j = 10; j < 20; j ++)
    {
      printf("%f ", h_Test[j]);
    }
  printf("\n");

  free(h_Test);
  cudaFree(d_Test);

  /* Memory clean up */
  free(h_A);
  cudaFree(d_A);
}

#ifndef __GLOBAL__CU__
#define __GLOBAL__CU__

__global__ void invokeDeviceCublasSgemm(float *d_A, float *Test)
{
  // save the first 10 elements of d_A in location 1
  for (int j = 0; j < 10; j ++)
    {
      Test[j] = d_A[j];
    }
  cublasHandle_t cnpHandle;
  cublasCreate(&cnpHandle);

    // save the first 10 elements of d_A in location 2
    for (int j = 10; j < 20; j ++)
      {
        Test[j] = d_A[j - 10];
      }
  cublasDestroy(cnpHandle);
}

#endif

If I set the configuration parameters as <<<1, 1>>>, everything works well. And the output is like that:

The first 10 elements of d_A in location 1 are:

0.840188 0.394383 0.783099 0.798440 0.911647 0.197551 0.335223 0.768230 0.277775 0.553970

The first 10 elements of d_A in location 2 are:

0.840188 0.394383 0.783099 0.798440 0.911647 0.197551 0.335223 0.768230 0.277775 0.553970

However, if I set the configuration parameters as <<<32768, 128>>>, the output is quite weird. And the output is like that:

The first 10 elements of d_A in location 1 are:

-0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

The first 10 elements of d_A in location 2 are:

0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

I really don't know why! My code is just come from the "samples" with a little change.


And then I just delete the last code "cublasDestroy(cnpHandle);", then it becomes normal. And the output is:

The first 10 elements of d_A in location 1 are:

0.840188 0.394383 0.783099 0.798440 0.911647 0.197551 0.335223 0.768230 0.277775 0.553970

The first 10 elements of d_A in location 2 are:

0.840188 0.394383 0.783099 0.798440 0.911647 0.197551 0.335223 0.768230 0.277775 0.553970


Did someone have the same problem?

Thanks!

هل كانت مفيدة؟

المحلول

Do some proper cuda error checking You can do it on your host API calls as well as your device API calls, and CUBLAS API calls (and kernel calls). If you're unsure, read the dynamic parallelism documentation.

It's likely that you are exceeding the number of kernel launches that can be outstanding at any time. There is a (configurable) limit of 2048 kernel launches that can be outstanding. Since your code fails with host kernel launch parameters of <<<32768, 128>>>, it means that you are trying to launch 32768x128 threads, each of which might try to launch a child kernel. If the number of kernel launches exceeds the limit, the remaining kernel launches will fail.

"But I'm not launching any child kernels??" Actually, using the device CUBLAS API implies that kernels may get launched. That is how the device CUBLAS system works.

To really get clarity, I would again strongly advise that you do solid error checking.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top