Usually one CPU thread can be used for calling a CUDA kernel. However, since CUDA 4.0, multiple CPU threads can share context. You can use cuCtxSetCurrent
to tie the context of the kernel to the current thread. More information about this API function can be found here.
Another workaround for this is to create a GPU worker thread that holds the context and pass any CUDA request to that thread.
Regarding your other question, without setting the context for the proper thread, I remember that cudaMalloc would not even execute (I work with JCuda so the behavior may be a little different). But if the context is currently set to the calling kernel, the memories will not be overwritten.