You are not logged in.
Hi, I just noticed that my 7900XTX hangs only when using python-pytorch-opt-rocm and python-pytorch-rocm from the official repository, but it doesn't hang when I run the same code inside venv with pytorch installed via pip (as instructed by PyTorch).
This is the python code:
import torch
x = torch.rand(2098, 1, device='cuda')
x # GPU hangs here
What is happening here: generate random tensor with dimension 2098x1, and then printing out the tensor x causes the GPU to hang.
It can also be triggered by
import torch
x = torch.rand(2098, 1, device='cpu') # generate the tensor on cpu first
x # This will print out x correctly
x.cuda() # GPU hangs here, when trying to copy the tensor to the GPU
I can see in journalctl -f that this message appears very shortly after x.cuda() is executed:
Jun 06 01:52:16 tomahawk kernel: amdgpu: sq_intr: error, detail 0x00000000, type 2, sh 1, priv 1, wave_id 0, simd_id 0, wgp_id 0
Now, running the same code above inside a virtual environment, with pytorch installed via pip as documented on the official PyTorch, gives the expected result:
>>> x.cuda()
tensor([[0.4804],
[0.3825],
[0.5009],
...,
[0.4668],
[0.7279],
[0.5362]], device='cuda:0')
Here is how the venv is created:
mkdir ~/venv
cd ~/venv
python -m venv test
source ~/venv/test/bin/activate
pip install --upgrade pip
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3
As of now, official PyTorch is 2.7.1 with ROCm=6.3, while Arch's pytorch is 2.7.0 with ROCm=6.4.0
Offline