That post seems clear but I'll repeat it in other words: you can't pass a std::vector to a CUDA kernel, the reason is the following:
CUDA kernels need to use device code, i.e. code that runs on your gpu or code that can be translated as such. In order to generate that code, almost everything you write needs to pass through a compilation chain which eventually generates intermediate code or executable code for your gpu.
By using STL constructs and algorithms you're using code that hasn't been written for the GPU and for which there's NO device equivalent code or emulation is slow/not even always possible.