I can't answer for sure why your kernel is processing more elements than expected. It's processing one elements per thread, so the number of elements processed definitely should be blockDim.x*gridDim.x.
I want to point out though, that it's good practice to write kernels that use "grid stride loops" so they aren't so dependent on the block and thread count. The performance cost is negligible and if you are performance-sensitive, the blocking parameters are different for different GPUs.
http://cudahandbook.to/15QbFWx
So you should add a count parameter (the number of elements to process), then write something like:
__global__ void VecAdd(float *d_dataA, float *d_dataB, float *d_resultC, int N)
{
for ( int i = blockIdx.x*blockDim.x + threadIdx.x;
i < N;
i += blockDim.x*gridDim.x ) {
d_resultC[i] = d_dataA[i] + d_dataB[i];
}
}