Hope this pic explain everything. The data is generated by bandwidthTest in CUDA samples. The hardware environment is PCI-E v2.0, Tesla M2090 and 2x Xeon E5-2609. Please note both axises are in log scale.
Given this figure, we can see that the overhead of launching a transfer request takes a constant time. Regression analysis on the data gives an estimated overhead time of 4.9us for H2D, 3.3us for D2H and 3.0us for D2D.