CUDA cube map textures

Question 1

OK - I managed to solve the problem myself, though the solution isn't as simple as using another CUDA function.

To bind a CUDA texture reference with any texture, be it one obtained from OpenGL or D3D, one has to provide a CUDA array that is mapped to a resource, using cuGraphicsSubResourceGetMappedArray to retrieve it. As I mentioned in the question, it is simple in case of a one or two dimensional texture. But with other available types it is more complicated.

At any time we need a CUDA array that the reference is bound to. Same goes with the cube map texture. But in such a case the array has to be a 3D one. The problem is that CUDA driver API provides only the aforementioned function to retrieve a single layer from such a texture resource, and map it to a single, two dimensional array. To get what we want we have to make ourselves the 3D array containing all the layers (or faces in case of a cube map).

First of all we have to get arrays for each layer/face using the above function. Next step is to create the 3D array by call to cuArray3DCreate, fed with proper set of parameters (size/number of layers, level of detail, data format, number of channels per texel and some flags). Then we have to copy the layers' arrays to the 3D one with a series of calls to cuMemcpy3D, one for each layer/face array.

Finally, we set our target CUDA texture reference with cuTexRefSetArray, fed with the 3D array we created and copied to. Inside of the device code we create a reference with proper texture type and mode (float4 and cube map) and sample it with texCubemap.

Below I put a fragment of the function which does all that, available in full length in CIRT Repository (cirt_server.c file, function cirtTexImage3D).

//...
if (result)
{
    // Create a 3D array...
    CUDA_ARRAY3D_DESCRIPTOR layeredTextureDescr;
    layeredTextureDescr.Width = w;
    layeredTextureDescr.Height = h;
    layeredTextureDescr.Depth = d;
    layeredTextureDescr.Format = map_type_to_format(type);
    layeredTextureDescr.NumChannels = format == CIRT_RGB ? CIRT_RGBA : format;
    layeredTextureDescr.Flags = map_target_to_flags(target);

    if (result) result = LogCUDADriverCall(cuArray3DCreate(&hTexRefArray, &layeredTextureDescr),
        FUN_NAME(": cuArray3DCreate_tex3D"), __FILE_LINE__);

    // Copy the acquired layer/face arrays into the collective 3D one...
    CUDA_MEMCPY3D layerCopyDescr;
    layerCopyDescr.srcMemoryType = CU_MEMORYTYPE_ARRAY;
    layerCopyDescr.srcXInBytes = 0;
    layerCopyDescr.srcZ = 0;
    layerCopyDescr.srcY = 0;
    layerCopyDescr.srcLOD = 0;

    layerCopyDescr.dstMemoryType = CU_MEMORYTYPE_ARRAY;
    layerCopyDescr.dstLOD = 0;

    layerCopyDescr.WidthInBytes = layeredTextureDescr.NumChannels * w;
    layerCopyDescr.Height = h;
    layerCopyDescr.Depth = target == CIRT_TEXTURE_CUBE_MAP ? 1 : d;
    layerCopyDescr.dstArray = hTexRefArray;

    for (i = 0; i < num_layers; ++i)
    {
        layer = ((num_layers == 6) ? CU_CUBEMAP_FACE_POSITIVE_X + i : i);
        layerCopyDescr.dstXInBytes = 0;
        layerCopyDescr.dstY = 0;
        layerCopyDescr.dstZ = i;
        layerCopyDescr.srcArray = hLayres[i];

        if (result) result = LogCUDADriverCall(cuMemcpy3D(&layerCopyDescr), 
            FUN_NAME(": cuMemcpy3D_tex3D"), __FILE_LINE__);
    }

    // Finally bind the 3D array with texture reference...
    if (result) LogCUDADriverCall(cuTexRefSetArray(hTexRef, hTexRefArray, CU_TRSA_OVERRIDE_FORMAT),
        FUN_NAME(": cuTexRefSetArray_tex3D"), __FILE_LINE__);

    if (hLayres)
        free(hLayres);

    if (result)
        current->m_oTextureManager.m_cuTextureRes[current->m_oTextureManager.m_nTexCount++] = hTexResource;
}
//...

I've checked it with cube maps only for now but it should work just fine with 3D texture as well.

Question 2

I'm not real familiar with CUDA directly but I do have some experience in OpenGL and DirectX and I am also familiar with 3D Graphics Rendering APIs, Libraries and Pipelines and having the ability to setup and use those APIs.

When I look at your question(s):

How to deal with OpenGL cube map textures in CUDA?

And you proceed to explain it by this:

When one want to use OpenGL textures in CUDA kernel one of the things to do is to retrieve a CUDA array from registered image and mapped resource, in this case a texture. In driver API it is done by cuGraphicsSubResourceGetMappedArray call, which in a case of 2D texture is not a problem. But when talking about aforementioned cube map, third parameter of this function requires a face enum (like CU_CUBEMAP_FACE_POSITIVE_X). Thus some questions arise - when one passes such an enum, then the returned texture array will contain only data of that particular face, right? Then how to use cube texture as a whole, to perform cube mapping, likewise:
color = texCube(cubeMap, x, y, z);
Or is it impossible to do so in CUDA kernal and one need to use 2D textures with proper calculations and sampling in user code?

I went to CUDA's website for their API SDK & Programming Documentations. And found the function in question cuGraphicsSubResourceGetMappedArray()

CUresult cuGraphicsSubResourceGetMappedArray ( CUarray* pArray,
                                               CUgraphicsResource resource, 
                                               unsigned int arrayIndex,
                                               unsigned int mipLevel ) 
Get an array through which to access a subresource of a mapped graphics resource.

Parameters

pArray - Returned array through which a subresource of resource may be accessed

resource - Mapped resource to access

arrayIndex - Array index for array textures or cubemap face index as defined by CUarray_cubemap_face for cubemap textures for the subresource to access

mipLevel - Mipmap level for the subresource to access

Returns

CUDA_SUCCESS, CUDA_ERROR_DEINITIALIZED, CUDA_ERROR_NOT_INITIALIZED,

CUDA_ERROR_INVALID_CONTEXT, CUDA_ERROR_INVALID_VALUE,

CUDA_ERROR_INVALID_HANDLE, CUDA_ERROR_NOT_MAPPED,

CUDA_ERROR_NOT_MAPPED_AS_ARRAY

Description

Returns in *pArray an array through which the subresource of the mapped graphics resource resource which corresponds to array index arrayIndex and mipmap level mipLevel may be accessed. The value set in *pArray may change every time that resource is mapped.

If resource is not a texture then it cannot be accessed via an array and CUDA_ERROR_NOT_MAPPED_AS_ARRAY is returned. If arrayIndex is not a valid array index for resource then CUDA_ERROR_INVALID_VALUE is returned. If mipLevel is not a valid mipmap level for resource then CUDA_ERROR_INVALID_VALUE is returned. If resource is not mapped then CUDA_ERROR_NOT_MAPPED is returned.

Note: Note that this function may also return error codes from previous, asynchronous launches.

See also:
cuGraphicsResourceGetMappedPointer
Read more at: http://docs.nvidia.com/cuda/cuda-driver-api/index.html#ixzz4ic22V4Dz Follow us: @GPUComputing on Twitter | NVIDIA on Facebook

This function method was found in NVidia CUDA's DriverAPI and not in their RuntimeAPI. When understanding hardware with CUDA capability is that there is a difference between the Host and Device programmable pipelines which can be found here: http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#axzz4ic6tFjXR

2. Heterogeneous Computing

CUDA programming involves running code on two different platforms concurrently: a host system with one or more CPUs and one or more CUDA-enabled NVIDIA GPU devices.

While NVIDIA GPUs are frequently associated with graphics, they are also powerful arithmetic engines capable of running thousands of lightweight threads in parallel. This capability makes them well suited to computations that can leverage parallel execution.

However, the device is based on a distinctly different design from the host system, and it's important to understand those differences and how they determine the performance of CUDA applications in order to use CUDA effectively.

2.1. Differences between Host and Device The primary differences are in threading model and in separate physical memories:

Threading resources - Execution pipelines on host systems can support a limited number of concurrent threads. Servers that have four hex-core processors today can run only 24 threads concurrently (or 48 if the CPUs support Hyper-Threading.) By comparison, the smallest executable unit of parallelism on a CUDA device comprises 32 threads (termed a warp of threads). Modern NVIDIA GPUs can support up to 1536 active threads concurrently per multiprocessor (see Features and Specifications of the CUDA C Programming Guide) On GPUs with 16 multiprocessors, this leads to more than 24,000 concurrently active threads.

Threads - Threads on a CPU are generally heavyweight entities. The operating system must swap threads on and off CPU execution channels to provide multithreading capability. Context switches (when two threads are swapped) are therefore slow and expensive. By comparison, threads on GPUs are extremely lightweight. In a typical system, thousands of threads are queued up for work (in warps of 32 threads each). If the GPU must wait on one warp of threads, it simply begins executing work on another. Because separate registers are allocated to all active threads, no swapping of registers or other state need occur when switching among GPU threads. Resources stay allocated to each thread until it completes its execution. In short, CPU cores are designed to minimize latency for one or two threads at a time each, whereas GPUs are designed to handle a large number of concurrent, lightweight threads in order to maximize throughput.

RAM - The host system and the device each have their own distinct attached physical memories. As the host and device memories are separated by the PCI Express (PCIe) bus, items in the host memory must occasionally be communicated across the bus to the device memory or vice versa as described in What Runs on a CUDA-Enabled Device?

These are the primary hardware differences between CPU hosts and GPU devices with respect to parallel programming. Other differences are discussed as they arise elsewhere in this document. Applications composed with these differences in mind can treat the host and device together as a cohesive heterogeneous system wherein each processing unit is leveraged to do the kind of work it does best: sequential work on the host and parallel work on the device.

Read more at: http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#ixzz4ic8ch2fq Follow us: @GPUComputing on Twitter | NVIDIA on Facebook

Now knowing that there are two different APIs for CUDAs API Libraries we have to understand the difference between the two found here: Difference Between the driver and runtime APIs

1. Difference between the driver and runtime APIs

The driver and runtime APIs are very similar and can for the most part be used interchangeably. However, there are some key differences worth noting between the two.

Complexity vs. control

The runtime API eases device code management by providing implicit initialization, context management, and module management. This leads to simpler code, but it also lacks the level of control that the driver API has.

In comparison, the driver API offers more fine-grained control, especially over contexts and module loading. Kernel launches are much more complex to implement, as the execution configuration and kernel parameters must be specified with explicit function calls. However, unlike the runtime, where all the kernels are automatically loaded during initialization and stay loaded for as long as the program runs, with the driver API it is possible to only keep the modules that are currently needed loaded, or even dynamically reload modules. The driver API is also language-independent as it only deals with cubin objects.

Context management

Context management can be done through the driver API, but is not exposed in the runtime API. Instead, the runtime API decides itself which context to use for a thread: if a context has been made current to the calling thread through the driver API, the runtime will use that, but if there is no such context, it uses a "primary context." Primary contexts are created as needed, one per device per process, are reference-counted, and are then destroyed when there are no more references to them. Within one process, all users of the runtime API will share the primary context, unless a context has been made current to each thread. The context that the runtime uses, i.e, either the current context or primary context, can be synchronized with cudaDeviceSynchronize(), and destroyed with cudaDeviceReset().

Using the runtime API with primary contexts has its tradeoffs, however. It can cause trouble for users writing plug-ins for larger software packages, for example, because if all plug-ins run in the same process, they will all share a context but will likely have no way to communicate with each other. So, if one of them calls cudaDeviceReset() after finishing all its CUDA work, the other plug-ins will fail because the context they were using was destroyed without their knowledge. To avoid this issue, CUDA clients can use the driver API to create and set the current context, and then use the runtime API to work with it. However, contexts may consume significant resources, such as device memory, extra host threads, and performance costs of context switching on the device. This runtime-driver context sharing is important when using the driver API in conjunction with libraries built on the runtime API, such as cuBLAS or cuFFT.

Read more at: http://docs.nvidia.com/cuda/cuda-driver-api/index.html#ixzz4icCoAXb7 Follow us: @GPUComputing on Twitter | NVIDIA on Facebook

Since this happens to be found in the DriverAPI it has more flexibility of control towards the programmer but also requires more responsibility to manage where the RuntimeAPI library does things more automatic but gives you less control.

This is apparent since you mentioned that you are working with their Kernels but from the description of their implementation of the function

 CUresult cuGraphicsSubResourceGetMappedArray ( CUarray* pArray,
                                                CUgraphicsResource resource, 
                                                unsigned int arrayIndex,
                                                unsigned int mipLevel )

The documentation is telling me that the first parameter that this function takes is a returned array through which a subresource of resource may be accessed. The second parameter of this function is the mapped graphics resource itself. The third parameter in which I believe is the parameter that you had in question where it is an enumerated type to a face and you then asked: When one passes such an enum, then the returned texture array will contain only data of that particular face, right? From what I gather and understand from the documentations is that this is an index value to an array of your cube map resource.

Which can be seen from their documentation:

arrayIndex - Array index for array textures or cubemap face index as defined by CUarray_cubemap_face for cubemap textures for the subresource to access

Read more at: http://docs.nvidia.com/cuda/cuda-driver-api/index.html#ixzz4icHnwe9v Follow us: @GPUComputing on Twitter | NVIDIA on Facebook

which happens to be an unsigned int or an index location into the textures that make up that cube map a typical cube map will have 6 faces or at most 12 if both inside and outside of the cube are mapped. So if we look at a cube map as well as textures and their relationship with pseudo code we can see that:

// Texture
struct Texture {
    unsigned pixelsWidth;
    unsigned pixelsHeight;        
    // Other Texture member variables or fields here.
};

// Only interested in the actual size of the texture `width by height`
// where these would be used to map this texture to one of the 6 faces
// of a cube:

struct CubeMap {
    Texture face[6];
    // face[0] = frontFace
    // face[1] = backFace
    // face[2] = leftFace
    // face[3] = rightFace
    // face[4] = topFace
    // face[5] = bottomFace
};

The cubemap object has an array of textures that makes up its face and according to the documents the function that you have in question with its third parameter is asking you for an index into this texture array and the overall function will return this:

Returns in *pArray an array through which the subresource of the mapped graphics resource resource which corresponds to array index arrayIndex and mipmap level mipLevel may be accessed. The value set in *pArray may change every time that resource is mapped.

Read more at: http://docs.nvidia.com/cuda/cuda-driver-api/index.html#ixzz4icKF1c00 Follow us: @GPUComputing on Twitter | NVIDIA on Facebook

I hope this helps to answer your question in regards to the use of the third parameter into the function you are trying to use from their API.

Edit

The OP had asked when passing this enum CU_CUBEMAP_FACE_POSITIVE_X to the third parameter of the above function call will it return only that face of the cube map which happens to be a texture. When looking at their documentation about this enumerated value or type found here: enum CUarray_cubemap_face

enum CUarray_cubemap_face - Array indices for cube faces

Values

CU_CUBEMAP_FACE_POSITIVE_X = 0x00

Positive X face of cubemap

CU_CUBEMAP_FACE_NEGATIVE_X = 0x01

Negative X face of cubemap

CU_CUBEMAP_FACE_POSITIVE_Y = 0x02

Positive Y face of cubemap

CU_CUBEMAP_FACE_NEGATIVE_Y = 0x03

Negative Y face of cubemap

CU_CUBEMAP_FACE_POSITIVE_Z = 0x04

Positive Z face of cubemap

CU_CUBEMAP_FACE_NEGATIVE_Z = 0x05

Negative Z face of cubemap

Read more at: http://docs.nvidia.com/cuda/cuda-driver-api/index.html#ixzz4idOT67US Follow us: @GPUComputing on Twitter | NVIDIA on Facebook

It appears to me that when using this method to query or get texture information that is stored into an array of a cube map, that the requirement of the third parameter being this enumerated value; is nothing more than the 0-index into that array. So by passing in CU_CUBEMAP_FACE_POSITIVE_X as the third parameter to me doesn't necessarily mean that you will only get back that particular face's texture. It appears to me that since this is the 0th index that it will return the entire array of textures. The old C style of passing around arrays as if they were pointers.