OpenCL Buffer Instantiation in a Multi Device Environment

https://stackoverflow.com/questions/22369124

13-06-2023
|

Question

I'm wondering how the system-side cl::Buffer objects instantiate in a multi-device context.

Let's say I have an OCL environment class, which generates, from cl::Platform, ONE cl::Context:

this->ocl_context = cl::Context(CL_DEVICE_TYPE_GPU, con_prop);

And then, the corresponding set of devices:

this->ocl_devices = this->ocl_context.getInfo<CL_CONTEXT_DEVICES>();

I generate one cl::CommandQueue object and one set of cl::Kernel(s) for EACH device.

Let's say I have 4 GPUs of the same type. Now I have 4x cl::Device objects in ocl_devices.

Now, what happens when I have a second handler class to manage computation on each device:

oclHandler h1(cl::Context* c, cl::CommandQueue  cq1, std::vector<cl::Kernel*> k1);
...
oclHandler h2(cl::Context* c, cl::CommandQueue  cq4, std::vector<cl::Kernel*> k4);

And then INSIDE EACH CLASS, I both instantiate:

oclHandler::classMemberFunction(
...
  this->buffer =
    cl::Buffer(
      *(this->ocl_context),
      CL_MEM_READ_WRITE,
      mem_size,
      NULL,
      NULL
    );

...
)

and then after, write to

oclHandler::classMemberFunction(
...
this->ocl_cq->enqueueWriteBuffer(
      this->buffer,
      CL_FALSE,
      static_cast<unsigned int>(0),
      mem_size,
      ptr,
      NULL,
      NULL
    );
...
this->ocl_cq.finish();
...
)

each buffer. There is a concern that, because the instantiation is for a cl::context, and not tied to a particular device, that there is maybe possibly quadruple memory address assignment on each device. I can't determine when the operation that says "on the device, this buffer runs from 0xXXXXXXXXXXXXXXXX for N bytes" occurs.

Should instantiate one context per device? That seems unnatural because I'd have to instantiate a context. See how many devices there are, and then instantiate d-1 more contexts....seems inefficient. My concern is with limiting available memory device side. I am doing computations on massive sets, and I'd probably be using all of the 6GB available on each card.

Thanks.

EDIT: is there a way to instantiate a buffer and fill it asynchronously without using a commandqueue? Like let's say I have 4 devices, and one buffer host side full of static, read only data. Let's say that buffer is like 500MB in size. If I want to just use clCreateBuffer, with

shared_buffer = new
cl::Buffer(
  this->ocl_context,
  CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR,
  total_mem_size,
  ptr
  NULL
);

that will start a blocking write, where I can not do anything host side until all of ptr's contents are copied to the newly allocated memory. I have a multithreaded device management system and I've created one cl::CommandQueue for each device, always passing along &shared_buffer for every kernel::setArg required. I'm having a hard time wrapping my head around what to do.

Solution

When you have a context that contains multiple devices, any buffers that you create within that context are visible to all of it's devices. This means that any device in the context could read from any buffer in the context, and the OpenCL implementation is in charge of making sure the data is actually moved to the correct devices as and when they need it. There are some grey areas around what should happen if multiple devices try and access the same buffer at the same time, but this kind of behaviour is generally avoided anyway.

Although all of the buffers are visible to all of the devices, this doesn't necessarily mean that they will be allocated on all of the devices. All of the OpenCL implementations that I've worked with use an 'allocate-on-first-use' policy, whereby the buffer is allocated on the device only when it is needed by that device. So in your particular case, you should end up with one buffer per device, as long as each buffer is only used by one device.

In theory an OpenCL implementation might pre-allocate all of the buffers on all the devices just in case they are needed, but I wouldn't expect this to happen in reality (and I've certainly never seen this happen). If you are running on a platform that has a GPU profiler available, you can often use the profiler to see when and where buffer allocations and data movement is actually occurring, to convince yourself that the system isn't doing anything undesirable.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow