CUDA: Memory copy to GPU 1 is slower in multi-GPU

https://stackoverflow.com/questions/2560446

23-09-2019
|

문제

My company has a setup of two GTX 295, so a total of 4 GPUs in a server, and we have several servers. We GPU 1 specifically was slow, in comparison to GPU 0, 2 and 3 so I wrote a little speed test to help find the cause of the problem.

//#include <stdio.h>
//#include <stdlib.h>
//#include <cuda_runtime.h>
#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
#include <cutil.h>

__global__ void test_kernel(float *d_data) {
    int tid = blockDim.x*blockIdx.x + threadIdx.x;
    for (int i=0;i<10000;++i) {
        d_data[tid] = float(i*2.2);
        d_data[tid] += 3.3;
    }
}

int main(int argc, char* argv[])
{

    int deviceCount;                                                         
    cudaGetDeviceCount(&deviceCount);
    int device = 0; //SELECT GPU HERE
    cudaSetDevice(device);


    cudaEvent_t start, stop;
    unsigned int num_vals = 200000000;
    float *h_data = new float[num_vals];
    for (int i=0;i<num_vals;++i) {
        h_data[i] = float(i);
    }

    float *d_data = NULL;
    float malloc_timer;
    cudaEventCreate(&start);
    cudaEventCreate(&stop); cudaEventRecord( start, 0 );
    cudaMemcpy(d_data, h_data, sizeof(float)*num_vals,cudaMemcpyHostToDevice);
    cudaMalloc((void**)&d_data, sizeof(float)*num_vals);
    cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &malloc_timer, start, stop );
    cudaEventDestroy( start );
    cudaEventDestroy( stop );


    float mem_timer;
    cudaEventCreate(&start);
    cudaEventCreate(&stop); cudaEventRecord( start, 0 );
    cudaMemcpy(d_data, h_data, sizeof(float)*num_vals,cudaMemcpyHostToDevice);
    cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &mem_timer, start, stop );
    cudaEventDestroy( start );
    cudaEventDestroy( stop );

    float kernel_timer;
    cudaEventCreate(&start);
    cudaEventCreate(&stop); cudaEventRecord( start, 0 );
    test_kernel<<<1000,256>>>(d_data);
    cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &kernel_timer, start, stop );
    cudaEventDestroy( start );
    cudaEventDestroy( stop );

    printf("cudaMalloc took %f ms\n",malloc_timer);
    printf("Copy to the GPU took %f ms\n",mem_timer);
    printf("Test Kernel took %f ms\n",kernel_timer);

    cudaMemcpy(h_data,d_data, sizeof(float)*num_vals,cudaMemcpyDeviceToHost);

    delete[] h_data;
    return 0;
}

The results are

GPU0 cudaMalloc took 0.908640 ms Copy to the GPU took 296.058777 ms Test Kernel took 326.721283 ms

GPU1 cudaMalloc took 0.913568 ms Copy to the GPU took 663.182251 ms Test Kernel took 326.710785 ms

GPU2 cudaMalloc took 0.925600 ms Copy to the GPU took 296.915039 ms Test Kernel took 327.127930 ms

GPU3 cudaMalloc took 0.920416 ms Copy to the GPU took 296.968384 ms Test Kernel took 327.038696 ms

As you can see, the cudaMemcpy to the GPU is well double the amount of time for GPU1. This is consistent between all our servers, it is always GPU1 that is slow. Any ideas why this may be? All servers are running windows XP.

해결책

This was a driver issue. Updating to the latest driver fixed it

다른 팁

This may be an issue with your pci bus, try swapping the cards into different slots to see if the problem persists. If this is an issue, copy all your data onto the gtx295 via the faster slot and use sli top copy it across to the other (slow pci bus) gpu.

If you can utilized the faster video card's gddr to load, then you can do a device device tansfer at much MUCH higher bandwidth, that might help eliminate the issue also. Also, check your bandwidth with NVidia's bandwidth testing to get some physical results and test.

Good luck!

이 방법은 스택 문제를 피합니다.

public void MyMethod(Notifier not)
{ 
    if(not.HasValue())
    {
        string methodName = System.Reflection.MethodBase.GetCurrentMethod().Name;
        throw new Exception(methodName + ": " + not.Value);
    }
}

[때때로 예기치 않게 예기치 않은 결과가있을 수 있습니다. 예를 들어, 소규모 메소드 또는 속성은 릴리스 빌드에서 종종 인라인되며,이 경우 결과는 대신 발신자의 메소드 이름이 될 것입니다.]

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow