Personal story: got a weird issue right in the middle of the training:
Unable to determine the device handle for GPU 0000:01:00.0:Unknow Error. Never happened to me before, had been training successfully for few month on this machine, long training sessions. Even worse, the process
python training.py got stuck, unresponsive to
kill, while GPU fans are ramping full-speed and stuck.
Here are my notes on resolving these issues.
Since recently (?), PyTorch comes with its own cuda libs. Unless you compile custom layers, you don’t need to have local cuda or cudnn. GPU drivers are required still.
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
ptrblck: binaries ship with their own libraries and will not use your locally installed CUDA toolkit unless you build PyTorch from source or a custom CUDA extension discuss.pytorch.org
Handy info from PyTorch:
print(torch.__version__) print(torch.version.cuda) print(torch.backends.cudnn.version()) print(torch.cuda.get_device_name(0)) print(torch.cuda.get_device_properties(0))
There are plenty of instruction on driver installation. Basically, you have 3 options:
- Install drivers with Ubuntu software update (by far the easiest way, available in 20.04)
- Install with
sudo ubuntu-drivers autoinstallor
sudo apt install nvidia-driver-440
- Download and install with CUDA runfile (they come with driver now)
It is recommended to get rid of installed drivers with
purge, however my runfile (I chose option 3) still complained that it found something previously installed.
ubuntu-drivers devicesshows device specs and available drivers
Note on nvidia-smi
nvidia-smi (stands for System Management Interface) reports “Driver Version” and “CUDA Version” even if no cuda is installed. It is simply a cuda version that the driver supports. Source: nvidia dev support
After installing, add to
export PATH="/usr/local/cuda-11.6/bin:$PATH" export LD_LIBRARY_PATH="/usr/local/cuda-11.6/lib64:$LD_LIBRARY_PATH"
I’ll jump right to the solution: it’s either overheating or lack of power. Similar issue was discussed on developer forum
dmesgto read system reports
sudo nvidia-bug-report.shto get full report. decompress file and enjoy.
Jul 27 16:39:09 emano kernel: NVRM: Xid (PCI:0000:1a:00): 79, pid=1370, GPU has fallen off the bus.(docs) One of the gpus is shutting down. Since it’s not always the same one, I guess they’re not damaged but either overheating or lack of power occurs. Please monitor temperatures, check PSU, try limiting clocks using nvidia-smi -lgc.
I limited both power consumption and clock speed and (I hope) it works now. Hope to find the exact problem and solve it (more cooling or better power unit).
nvidia-smi -lgc 1500to set clockspeed (note1)
sudo nvidia-smi -pl 200to limit power consumption to 200 w
nvidia-smi -q -d CLOCKto check clock speed
Note that those limits are not permanent and will be reset after reboot.
- Fan maxed out because two other are stuck
- some reported issues between motherboard and gpus (pcie lanes for multi-gpu setup) [https://forums.developer.nvidia.com/t/unable-to-determine-the-device-handle-for-gpu-000000-0-gpu-is-lost-reboot-the-system-to-recover-this-gpu/176891/6]
- more discussion on error
XID 79dev forum 1 2 3