CUDA kernel automatically recall kernel to finish vector addition. Why?

Question 1

Have you launched it first with ARRAY_SIZE threads and then with the half of them? (or 1/8)

You are not initializing d_resultC, so it's probably that d_resultC has the result of the previous executions. That would explain that behavior, but maybe it doesn't.

Add a cudaMemset over d_result_C and tell us what happens.

Question 2

I can't answer for sure why your kernel is processing more elements than expected. It's processing one elements per thread, so the number of elements processed definitely should be blockDim.x*gridDim.x.

I want to point out though, that it's good practice to write kernels that use "grid stride loops" so they aren't so dependent on the block and thread count. The performance cost is negligible and if you are performance-sensitive, the blocking parameters are different for different GPUs.

http://cudahandbook.to/15QbFWx

So you should add a count parameter (the number of elements to process), then write something like:

__global__ void VecAdd(float *d_dataA, float *d_dataB, float *d_resultC, int N)
{
    for ( int i = blockIdx.x*blockDim.x + threadIdx.x;
              i < N;
              i += blockDim.x*gridDim.x ) {
        d_resultC[i] = d_dataA[i] + d_dataB[i];
    }
}

Question 3

As some guys mentioned above. This may be caused by the remain data from your previous run. You didn't free the memory you allocated may be the reason of this odd situation. I think you should free the allocated arrays on the host using free and also free the memory on the GPU using CudaFree

Also I strongly recommend you to allocate the host memory using CudaMallocHost instead of malloc and free them at the end of the program by CudaFreeHost. This will give you fast copy. See here: CudaMallocHost

Anyway, don't forget to free heap memory on C/C++ program, whether with CUDA or not.