Does CUDA really not have a calloc()-like API call?

Question 1

Is there really no API functionality for allocating a buffer initialized to all-zeros?

There really is not.

Is there something better I can do that cudaMalloc() followed by cudaMemset()?

You could use a macro, if it's a matter of convenience (you haven't told us what you mean by better, if the answer to the first question is no):

#define cudaCalloc(A, B, C) \
    do { \
        cudaError_t __cudaCalloc_err = cudaMalloc(A, B*C); \
        if (__cudaCalloc_err == cudaSuccess) cudaMemset(*A, 0, B*C); \
    } while (0)

The above macro will work with the kind of error checking I usually do (which is based on using cudaGetLastError(); or you can build your preferred error checking directly into the macro, if you like. See this question about error handling.

Question 2

If all you want is a simple way to zero out new allocations, you can use thrust::device_vector, which default constructs its elements. For primitive types, this is the same behavior as calloc.

Question 3

There is no calloc()-like functionality in the CUDA Runtime API, nor another, lower-level equivalent. Instead, you can do the following:

cudaMalloc(&ptr, size);
cudaMemset(ptr, 0, size);

note that this is all synchronous. There's a cudaMemsetAsync() as well, although, frankly, cudaMalloc()s are currently slow enough that it doesn't really matter.

Question 4

Here is a solution with an inline function. devPtr is supposed to be a pointer to pointer to anything. Using a void* as function argument releases the caller from applying a cast.

inline cudaError_t
_cuda_calloc( void *devPtr, size_t size )
{
  cudaError_t err = cudaMalloc( (void**)devPtr, size );
  if( err == cudaSuccess ) err = cudaMemset( *(void**)devPtr, 0, size );
  return err;
}