Your fixed bandwidth requirement is very strict.
If a & b are both Device-side buffers, my advice is to use pinned memory technique with simple CL_MEM_READ_WRITE allocation flags. Though, best results I've ever achieved were around 5.3 GB/s on PCIe 2.0 x16. Taking into account, that memory transfer is usually matter of microseconds, you can do the trick with non-blocking memory transfer with event waiting on Host side. Such conveyer-fashion tasking usually shows good bandwidth.
Generally (for a, b, c & d buffers), my advice is to use separate command queue for every type of memory traffic to get benefits of DMA transfers. Usually, 3 command queues are enough - Host-to-Device, Device-to-Device & Host-to-Host.
GPU memory controllers are designed to provide good peak throughput with high latencies, so massive async memory transfers are generally faster. Try to avoid any synchronous operations, they freezes up the whole thing.