The nvidia-smi output shows an uncorrectable ECC error on the device. You can reset the error using nvidia-smi --reset-ecc-errors=0 -g 0
and retry. The 0
in the reset indicates to reset the volatile counter only, the aggregate counter will still indicate that an error has happened in the past.
If you see further errors from the device then it would be worth investigating the cause further.
Note that in the summary view the ECC field you are looking at is actually "Volatile Uncorr. ECC", i.e. it's the error count not the ECC enabled/disabled flag. If ECC is disabled it will say "N/A".