The speed of copying the same data from host to device should be similar no matter which copy API you use.
However the size of the data block to be copied matters a lot. Here is a benchmark showing the relationship between the data size and the copy speed using CUDA's cudaMemcpy()
.
CUDA - how much slower is transferring over PCI-E?
You could simply estimate the average speed from the above figure if you know the number of copy API you will invoke and the data size of each copy.
When the element size is small and the number of elements is large, copying only changed elements individually from host to device by invoking the copy API thousands of times is definitely not a good idea.