Memory layout mismatching between CPU and GPU code with CUDA

https://stackoverflow.com/questions/16641004

30-05-2022
|

Question

I'm experiencing a very weird situation. I have this template structures:

#ifdef __CUDACC__
#define __HOSTDEVICE __host__ __device__
#else
#define __HOSTDEVICE
#endif

template <typename T>
struct matrix
{
    T* ptr;
    int col_size, row_size;
    int stride;
    // some host & device methods
};

struct dummy1 {};
struct dummy2 : dummy1 {};

template <typename T>
struct a_functor : dummy2
{
    matriz<T> help_m;
    matrix<T> x, y;
    T *x_ptr, *y_ptr;
    int bsx, ind_thr;
    __HOSTDEVICE void operator()(T* __x, T* __y)
    {
        // functor code
    }
};

I've structured my code to separate cpp and cu files, so a_functor object is created in cpp file and used in a kernel function. The problem is that, executing operator() inside a kernel, I found some random behaviour I couldn't explain only looking at code. It was like my structs were sort of corrupted. So, calling a sizeof() on an a_functor object, I found:

CPU code (.cpp and .cu outside kernel): 64 bytes
GPU code (inside kernel): 68 bytes

There was obviously some kind of mismatching that ruined the whole stuff. Going further, I tracked the distance between struct parameter pointers and struct itself - to try to inspect the produced memory layout - and here's what I found:

a_functor foo;
// CPU
(char*)(&foo.help_m)    - (char*)(&foo)   = 0
(char*)(&foo.x)         - (char*)(&foo)   = 16
(char*)(&foo.y)         - (char*)(&foo)   = 32
(char*)(&foo.x_ptr)     - (char*)(&foo)   = 48
(char*)(&foo.y_ptr)     - (char*)(&foo)   = 52
(char*)(&foo.bsx)       - (char*)(&foo)   = 56
(char*)(&foo.ind_thr)   - (char*)(&foo)   = 60

// GPU - inside a_functor::operator(), in-kernel
(char*)(&this->help_m)  - (char*)(this)   = 4
(char*)(&this->x)       - (char*)(this)   = 20
(char*)(&this->y)       - (char*)(this)   = 36
(char*)(&this->x_ptr)   - (char*)(this)   = 52
(char*)(&this->y_ptr)   - (char*)(this)   = 56
(char*)(&this->bsx)     - (char*)(this)   = 60
(char*)(&this->ind_thr) - (char*)(this)   = 64

I really can't understand why nvcc generated this memory layout for my struct (what are that 4 bytes supposed to be/do!?!). I thought it could be an alignment problem and I tryed to explicitly align a_functor, but I can't because it is passed by value in kernel

template <typename T, typename Str>
__global__ void mykernel(Str foo, T* src, T*dst);

and when I try compile I get

error: cannot pass a parameter with a too large explicit alignment to a global routine on win32 platforms

So, to solve this strange situation (...and I do think that's an nvcc bug), what should I do? The only thing I can think of is playing with alignment and passing my struct to kernel by pointer to avoid the aforementioned error. However, I'm really wondering: why that memory layout mismatching?! It really makes no sense...

Further information: I'm using Visual Studio 2008, compiling with MSVC on Windows XP 32bit platform. I installed the latest CUDA Toolkit 5.0.35. My card is a GeForce GTX 570 (compute capability 2.0).

Solution

From the comments it appears there may be differences between the code you're actually running and the code you've posted, so it's difficult to give more than vague answers without someone being able to reproduce the problem. That said, on Windows there are cases where the layout and size of a struct can differ between the CPU and the GPU, these are documented in the programming guide:

On Windows, the CUDA compiler may produce a different memory layout, compared to the host Microsoft compiler, for a C++ object of class type T that satisfies any of the following conditions:

T has virtual functions or derives from a direct or indirect base class that has virtual functions;

T has a direct or indirect virtual base class;

T has multiple inheritance with more than one direct or indirect empty base class.

The size for such an object may also be different in host and device code. As long as type T is used exclusively in host or device code, the program should work correctly. Do not pass objects of type T between host and device code (e.g., as arguments to global functions or through cudaMemcpy*() calls).

The third case may apply in your case where you have an empty base class, do you have multiple inheritance in the real code?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow