Debugging Ubuntu / Nvidia / CUDA / PyTorch relations

Personal story: got a weird issue right in the middle of the training: Unable to determine the device handle for GPU 0000:01:00.0:Unknow Error. Never happened to me before, had been training successfully for few month on this machine, long training sessions. Even worse, the process python training.py got stuck, unresponsive to kill, while GPU fans are ramping full-speed and stuck.

Here are my notes on resolving these issues.

PyTorch

Since recently (?), PyTorch comes with its own cuda libs. Unless you compile custom layers, you don’t need to have local cuda or cudnn. GPU drivers are required still.

conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

Source:

ptrblck: binaries ship with their own libraries and will not use your locally installed CUDA toolkit unless you build PyTorch from source or a custom CUDA extension discuss.pytorch.org

Handy info from PyTorch:

print(torch.__version__)
print(torch.version.cuda)
print(torch.backends.cudnn.version())
print(torch.cuda.get_device_name(0))
print(torch.cuda.get_device_properties(0))

Drivers

There are plenty of instruction on driver installation. Basically, you have 3 options:

Install drivers with Ubuntu software update (by far the easiest way, available in 20.04)
Install with sudo ubuntu-drivers autoinstall or sudo apt install nvidia-driver-440
Download and install with CUDA runfile (they come with driver now)

It is recommended to get rid of installed drivers with purge, however my runfile (I chose option 3) still complained that it found something previously installed.

Handy commands:

ubuntu-drivers devices shows device specs and available drivers

Note on nvidia-smi

nvidia-smi (stands for System Management Interface) reports “Driver Version” and “CUDA Version” even if no cuda is installed. It is simply a cuda version that the driver supports. Source: nvidia dev support

Useful nvidia-smi queries
sudo nvidia-smi --gpu-reset to reset gpu memory from pytorch forum

CUDA

After installing, add to .bashrc:

export PATH="/usr/local/cuda-11.6/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-11.6/lib64:$LD_LIBRARY_PATH"

My issue

I’ll jump right to the solution: it’s either overheating or lack of power. Similar issue was discussed on developer forum

dmesg to read system reports
sudo nvidia-bug-report.sh to get full report. decompress file and enjoy.

generix: Jul 27 16:39:09 emano kernel: NVRM: Xid (PCI:0000:1a:00): 79, pid=1370, GPU has fallen off the bus. (docs) One of the gpus is shutting down. Since it’s not always the same one, I guess they’re not damaged but either overheating or lack of power occurs. Please monitor temperatures, check PSU, try limiting clocks using nvidia-smi -lgc.

Solution

I limited both power consumption and clock speed and (I hope) it works now. Hope to find the exact problem and solve it (more cooling or better power unit).

nvidia-smi -lgc 1500 to set clockspeed (note¹)
sudo nvidia-smi -pl 200 to limit power consumption to 200 w
nvidia-smi -q -d CLOCK to check clock speed

Note that those limits are not permanent and will be reset after reboot.

Also

Fan maxed out because two other are stuck
some reported issues between motherboard and gpus (pcie lanes for multi-gpu setup) [https://forums.developer.nvidia.com/t/unable-to-determine-the-device-handle-for-gpu-000000-0-gpu-is-lost-reboot-the-system-to-recover-this-gpu/176891/6]
more discussion on error XID 79 dev forum 1 2 3

reports that nvidia removed support of clock limits for commercial cards ↩