Is "cudaMallocManaged" slower than "cudaMalloc"?

Question

cudaMallocManaged() is not about speeding up your application (with a few exceptions or corner cases, some are suggested below).

Today's implementation of Unified Memory and cudaMallocManaged will not be faster than intelligently written code written by a proficient CUDA programmer, to do the same thing. The machine (cuda runtime) is not smarter than you are as a programmer. cudaMallocManaged does not magically make the PCIE bus or general machine architectural limitations disappear.

Fast prototyping refers to the time it takes you to write the code, not the speed of the code.

cudaMallocManaged may be of interest to a proficient cuda programmer in the following situations:

You're interested in quickly getting a prototype together -i.e. you don't care about the last ounce of performance.
You are dealing with a complicated data structure which you use infrequently (e.g. a doubly linked list) which would otherwise be a chore to port to CUDA (since deep copies using ordinary CUDA code tend to be a chore). It's necessary for your application to work, but not part of the performance path.
You would ordinarily use zero-copy. There may be situations where using cudaMallocManaged could be faster than a naive or inefficient zero-copy approach.
You are working on a Jetson device.

cudaMallocManaged may be of interest to a non-proficient CUDA programmer in that it allows you to get your feet wet with CUDA along a possibly simpler learning curve. (However, note that naive usage of cudaMallocManaged may result in a CUDA kernels running slower than expected, see here and here.)

Although Maxwell is mentioned in the comments, CUDA UM will offer major new features with the Pascal generation of GPUs, in some settings, for some GPUs. In particular, Unified Memory in these settings will no longer be limited to the available GPU device memory, and the memory handling granularity will drop to the page level even when the kernel is running. You can read more about it here.