Disabled ECC support for Tesla C2070 and Ubuntu 12.04

https://stackoverflow.com/questions/12295768

30-06-2021
|

Pregunta

I have a headless workstation running Ubuntu 12.04 server and recently installed new Tesla C2070 card, but when running the examples from the CUDA SDK, I get the following error:

NVIDIA_GPU_Computing_SDK/C/bin/linux/release% ./reduction 
[reduction] starting...

Using Device 0: Tesla C2070

Reducing array of type int

16777216 elements
256 threads (max)
64 blocks

reduction.cpp(473) : cudaSafeCallNoSync() Runtime API error 39 : uncorrectable ECC error encountered.

Actually, this error occurs with all other examples except "deviceQuery".

I'm using kernel 3.2.0, nvidia driver 295.41 and Cuda 4.2.9.

After a lot of searching found a suggestion to disable the ecc support by:

   nvidia-smi -g 0 --ecc-config=0

which worked. But the question is how reliable will be the GPU computing with disabled ecc support?

Any advice, suggestion or solution will be highly appreciated.

-Konstantin

Solución

I'm wondering if this may be some sort of compatibility issue, rather than a bad card. I'm suffering from the same problem with a Tesla C2075, same Ubuntu version. We contacted nVidia and they told us that double-bit ECC errors (as seen using nvidia-smi -q in linux) meant that the card was probably broken. We obtained a replacement, but it has exactly the same issues.

It seems unlikely that both the boards I have had are broken in the same way, so we're going to try it in another machine if we can find a suitable one.

I'll post anything interesting that we learn.

Otros consejos

I'll echo what aland said and add my own experience.

I worked with a number of Fermi equipped compute clusters and tested them variably with ECC on and off. We did this to increase the amount of memory available and the speed of the computations, which was noticeable. nvidia-smi never reported any ECC errors for those cards with ECC on, nor did we ever encounter any runtime errors that were indicative of ECC related problems.

If your card is detecting uncorrectable ECC problems, that indicates a flaw in the hardware, and turning ECC off is only masking the problem. The runtime is rightly warning you that something bad has gone wrong, and you can't depend on the results.

You can try running your calculations anyway and see what happens, but be prepared for anything going absolutely crazy for no real reason. A single bit flipped here or there can have enormous consequences for floating point math for example, and may flat out crash your kernel if an instruction gets corrupted.

If you can, I would try to get the card replaced rather than masking the symptoms.

It turned out my case is the same as carthurs's. I also got my card replaced, but the error didn't go away. Only after setting the motherboard's onboard VGA as primary in the BIOS it disappeared. There should be a warning about this in the Tesla installation manual!

Thanks everybody for the help.

Once a GPU uncorrectable ECC error occurs the GPU might be in unstable state (e.g. data corruption could have occurred not only in user allocated memory but also in memory region necessary for GPU operation). To recover the GPU you need to either power cycle/reboot your system or try to use GPU Reset from nvidia-smi

nvidia-smi -h
...
-r    --gpu-reset           Trigger secondary bus reset of the GPU.
                            Can be used to reset GPU HW state in situations
                            that would otherwise require a machine reboot.
                            Typically useful if a double bit ECC error has
                            occurred.
                            --id= switch is mandatory for this switch

Type man nvidia-smi for more help on that topic

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow