Debugging Ubuntu / NVidia / CUDA / PyTorch relations
Personal story: got a weird issue right in the middle of the training: Unable to determine the device handle for GPU 0000:01:00.0:Unknow Error
. Never happened to me before, had been training successfully for few month on this machine, long training sessions. Even worse, the process python training.py
got stuck, unresponsive to kill
, while GPU fans are ramping full-speed and stuck.
Here are my notes on resolving these issues.
PyTorch
Since recently (?), PyTorch comes with its own cuda libs. Unless you compile custom layers, you don’t need to have local cuda or cudnn. GPU drivers are required still.
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
Source:
ptrblck: binaries ship with their own libraries and will not use your locally installed CUDA toolkit unless you build PyTorch from source or a custom CUDA extension discuss.pytorch.org
Handy info from PyTorch:
print(torch.__version__)
print(torch.version.cuda)
print(torch.backends.cudnn.version())
print(torch.cuda.get_device_name(0))
print(torch.cuda.get_device_properties(0))
Drivers
There are plenty of instruction on driver installation. Basically, you have 3 options:
- Install drivers with Ubuntu software update (by far the easiest way, available in 20.04)
- Install with
sudo ubuntu-drivers autoinstall
orsudo apt install nvidia-driver-440
- Download and install with CUDA runfile (they come with driver now)
It is recommended to get rid of installed drivers with purge
, however my runfile (I chose option 3) still complained that it found something previously installed.
Handy commands:
-
ubuntu-drivers devices
shows device specs and available drivers
Note on nvidia-smi
nvidia-smi
(stands for System Management Interface) reports “Driver Version” and “CUDA Version” even if no cuda is installed. It is simply a cuda version that the driver supports. Source: nvidia dev support
- Useful nvidia-smi queries
-
sudo nvidia-smi --gpu-reset
to reset gpu memory from pytorch forum
CUDA
After installing, add to .bashrc
:
export PATH="/usr/local/cuda-11.6/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-11.6/lib64:$LD_LIBRARY_PATH"
My issue
I’ll jump right to the solution: it’s either overheating or lack of power. Similar issue was discussed on developer forum
-
dmesg
to read system reports -
sudo nvidia-bug-report.sh
to get full report. decompress file and enjoy.
generix:
Jul 27 16:39:09 emano kernel: NVRM: Xid (PCI:0000:1a:00): 79, pid=1370, GPU has fallen off the bus.
(docs) One of the gpus is shutting down. Since it’s not always the same one, I guess they’re not damaged but either overheating or lack of power occurs. Please monitor temperatures, check PSU, try limiting clocks using nvidia-smi -lgc.
Solution
I limited both power consumption and clock speed and (I hope) it works now. Hope to find the exact problem and solve it (more cooling or better power unit).
-
nvidia-smi -lgc 1500
to set clockspeed (note1) -
sudo nvidia-smi -pl 200
to limit power consumption to 200 w -
nvidia-smi -q -d CLOCK
to check clock speed
Note that those limits are not permanent and will be reset after reboot.
Also
- Fan maxed out because two other are stuck
- some reported issues between motherboard and gpus (pcie lanes for multi-gpu setup) [https://forums.developer.nvidia.com/t/unable-to-determine-the-device-handle-for-gpu-000000-0-gpu-is-lost-reboot-the-system-to-recover-this-gpu/176891/6]
- more discussion on error
XID 79
dev forum 1 2 3
Enjoy Reading This Article?
Here are some more articles you might like to read next: