cuda runtime api and dynamic kernel definition

Question

To summarize, you are tilting at windmills here. By relying on extremely out of date information, you seem to have concluded that runtime and driver API interoperability isn't supported in CUDA, when, in fact, it has been since the CUDA 3.0 beta was released in 2009. Quoting from the release notes of that version:

The CUDA Toolkit 3.0 Beta is now available.

Highlights for this release include:

CUDA Driver / Runtime Buffer Interoperability, which allows applications using the CUDA Driver API to also use libraries implemented using the CUDA C Runtime.

There is documentation here which succinctly describes how the driver and runtime API interact.

To concretely answer your main question:

If one wants dynamic kernel definition as in cuModuleLoad and cublas at the same time, what are the options?

The basic approach goes something like this:

Use the driver API to establish a context on the device as you would normally do.
Call the runtime API routine cudaSetDevice(). The runtime API will automagically bind to the existing driver API context. Note that device enumeration is identical and common between both APIs, so if you establish context on a given device number in the driver API, the same number will select the same GPU in the driver API
You are now free to use any CUDA runtime API call or any library built on the CUDA runtime API. Behaviour is the same as if you relied on runtime API "lazy" context establishment