cudaMalloc of a structure and an element of same structure

Question

I would suggest that you put some effort into compiling and running your codes with proper cuda error checking. Learning to interpret the compiler output and runtime output will make you a better, smarter, more efficient coder. I also suggest reviewing the writeup I previously pointed you at here. It deals with this exact topic, and includes linked worked examples. This question is a duplicate of that one.

There are various errors:

StructA *d_A = (StructA*)malloc(numS * sizeof(StructA));

The above line of code creates an allocation in host memory for a structure of size StructA, and sets the pointer d_A pointing to the start of that allocation. Nothing wrong at the moment.

cudaMalloc( (void**)&(d_A), numS * sizeof(StructA) );

The above line of code creates an allocation in device memory of the size of StructA, and sets the pointer d_A pointing to the start of that allocation. This has effectively wiped out the previous pointer and allocation. (The previous host allocation is still somewhere, but you can't access it. It's basically lost.) Surely that was not your intent.

int *h_A = d_a->a;

Now that d_A (I assume you meant d_A, not d_a) has been assigned as a device memory pointer, the -> operation will dereference that pointer to locate the element a. This is illegal in host code and will throw an error (seg fault).

cudaMalloc( &(d_A->a), row*col*sizeof(int) );

This line of code has a similar issue. We cannot cudaMalloc a pointer that lives in device memory. cudaMalloc creates pointers that live in host memory but reference a location in device memory. This operation &(d_A->a) is dereferencing a device pointer, which is illegal in host code.

A proper code would be something like this:

$ cat t363.cu
#include <stdio.h>

typedef struct {
  int *a;
  int foo;
} StructA;

__global__ void kernel(StructA *data){

  printf("The value is %d\n", *(data->a + 2));
}

int main()
{
  int  numS = 1; // defined at runtime

  //allocate host memory for the structure storage
  StructA *h_A = (StructA*)malloc(numS * sizeof(StructA));
  //allocate host memory for the storage pointed to by the embedded pointer
  h_A->a = (int *)malloc(10*sizeof(int));
  // initialize data pointed to by the embedded pointer
  for (int i = 0; i <10; i++) *(h_A->a+i) = i;
  StructA *d_A;  // pointer for device structure storage
  //allocate device memory for the structure storage
  cudaMalloc( (void**)&(d_A), numS * sizeof(StructA) );
  // create a pointer for cudaMalloc to use for embedded pointer device storage
  int *temp;
  //allocate device storage for the embedded pointer storage
  cudaMalloc((void **)&temp, 10*sizeof(int));
  //copy this newly created *pointer* to it's proper location in the device copy of the structure
  cudaMemcpy(&(d_A->a), &temp, sizeof(int *), cudaMemcpyHostToDevice);
  //copy the data pointed to by the embedded pointer from the host to the device
  cudaMemcpy(temp, h_A->a, 10*sizeof(int), cudaMemcpyHostToDevice);

  kernel<<<1, 1>>>(d_A); // Passing pointer to StructA in device
  cudaDeviceSynchronize();
}
$ nvcc -arch=sm_20 -o t363 t363.cu
$ cuda-memcheck ./t363
========= CUDA-MEMCHECK
The value is 2
========= ERROR SUMMARY: 0 errors
$

You'll note that I haven't worked out the case where you are dealing with an array of StructA (i.e. numS > 1), that will require a loop. I'll leave it to you to work through the logic I've presented here and in my previous linked answer to see if you can work out the details of that loop. Furthermore, for the sake of clarity/brevity I've dispensed with the usual cuda error checking but please use it in your codes. Finally, this process (sometimes called a "deep copy operation") is somewhat tedious in ordinary CUDA if you haven't concluded that yet. Previous recommendations along these lines are to "flatten" such structures (so that they don't contiain pointers), but you can also explore cudaMallocManaged i.e. Unified Memory in CUDA 6.