You are not logged in.

#1 Yesterday 04:08:48

astronomical
Member
Registered: Yesterday
Posts: 3

amdgpu rocm GCVM_L2_PROTECTION_FAULT_STATUS in linux-firmware-20251125

I'm running a Framework Desktop with an AMD Max+ 395 128GB. I recently discovered that some ROCm workloads no longer function when running linux-firmware-20251125. Downgrading to linux-firmware-20251111 (and its associated dependencies) fixes this.

I suspect it's the amdgpu firmware specifically, but I haven't tried to individually downgrade specific firmware files.

EDIT: I should note this occurs using the AUR package llama.cpp-hip as well. This is using 6.17.9-arch1-1 and 6.12.59-1-lts kernels. On the root user, a different series of error messages appear in the journal log and the loading of the model hangs. Downgrading just the linux-firmware-amdgpu package resolves the bug.

I'm able to reproduce this with the following steps (or when run as privileged user or root user). Note that for the gidmap to work properly, you need to add in /etc/subgid a line like "${user}:${renderGID}:1" and "${user}:${videoGID}:1". You can omit this if you run the container in the --privileged context or some other way; it's just to share the /dev/dri and /dev/kfd devices appropriately.

podman pull docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-6.4.4-rocwmma
toolbox create llama-rocm-6.4.4-rocwmma --image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-6.4.4-rocwmma -- --gidmap="+g39:@988" --gidmap="+g105:@984" --device /dev/dri --device /dev/kfd --group-add video --group-add render --security-opt seccomp=unconfined

set -o allexport
: "${LLAMA_ARG_CTX_SIZE:=32768}"
: "${LLAMA_ARG_FLASH_ATTN:=1}"
: "${LLAMA_ARG_NO_MMAP:=1}"
: "${LLAMA_ARG_DEVICE:=ROCm0}"
: "${LLAMA_ARG_N_GPU_LAYERS:=999}"
: "${LLAMA_LOG_COLORS:=1}"
: "${LLAMA_LOG_VERBOSITY:=1}"
: "${LLAMA_LOG_PREFIX:=llama.cpp}"
: "${LLAMA_LOG_TIMESTAMPS:=1}"

: "${LLAMA_ARG_CONTEXT_SHIFT:=1}"
: "${LLAMA_ARG_HOST:=0.0.0.0}"
: "${LLAMA_ARG_PORT:=8080}"
: "${LLAMA_ARG_NO_WEBUI:=1}"
: "${LLAMA_ARG_CACHE_REUSE:=512}"
: "${LLAMA_ARG_ENDPOINT_PROPS:=1}"
: "${LLAMA_ARG_ENDPOINT_SLOTS:=1}"
: "${LLAMA_ARG_JINJA:=1}"
set +o allexport

llama-server -kvu --cache-reuse 256 --model ~/.local/share/models/bartowski/TheDrummer_Behemoth-X-123B-v2-GGUF/TheDrummer_Behemoth-X-123B-v2-Q5_K_M/TheDrummer_Behemoth-X-123B-v2-Q5_K_M-00001-of-00003.gguf 

The issue is when using llama.cpp to load a large model. It causes an error to be printed like:

Memory access fault by GPU node-1 (Agent handle: 0x40761df0) on address 0x7fd5d2d94000. Reason: Page not present or supervisor privilege.

Here's a log snippet:

Nov 28 02:33:27 arch-framework-d systemd[955]: Started podman-1760.scope.
Nov 28 02:33:27 arch-framework-d podman[1760]: 2025-11-28 02:33:27.818011349 -0700 MST m=+0.042534024 container exec c0d16d1b261124291c8ce30360f5f2eaf53ee3ab9851190f0ba2463ea2868091 (image=docker.io/kyuz0/amd-strix-halo-toolboxes:rocm->
Nov 28 02:33:30 arch-framework-d kernel: amdxdna 0000:c4:00.1: [drm] *ERROR* amdxdna_drm_open: SVA bind device failed, ret -19
Nov 28 02:33:35 arch-framework-d kernel: amdgpu 0000:c3:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:8 pasid:32770)
Nov 28 02:33:35 arch-framework-d kernel: amdgpu 0000:c3:00.0: amdgpu:  Process llama-server pid 1810 thread llama-server pid 1810
Nov 28 02:33:35 arch-framework-d kernel: amdgpu 0000:c3:00.0: amdgpu:   in page starting at address 0x00007fa25ab81000 from client 10
Nov 28 02:33:35 arch-framework-d kernel: amdgpu 0000:c3:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00800932
Nov 28 02:33:35 arch-framework-d kernel: amdgpu 0000:c3:00.0: amdgpu:          Faulty UTCL2 client ID: CPF (0x4)
Nov 28 02:33:35 arch-framework-d kernel: amdgpu 0000:c3:00.0: amdgpu:          MORE_FAULTS: 0x0
Nov 28 02:33:35 arch-framework-d kernel: amdgpu 0000:c3:00.0: amdgpu:          WALKER_ERROR: 0x1
Nov 28 02:33:35 arch-framework-d kernel: amdgpu 0000:c3:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Nov 28 02:33:35 arch-framework-d kernel: amdgpu 0000:c3:00.0: amdgpu:          MAPPING_ERROR: 0x1
Nov 28 02:33:35 arch-framework-d kernel: amdgpu 0000:c3:00.0: amdgpu:          RW: 0x0
Nov 28 02:33:35 arch-framework-d systemd-coredump[1849]: Process 1810 (llama-server) of user 1000 terminated abnormally with signal 6/ABRT, processing...
Nov 28 02:33:35 arch-framework-d systemd[1]: Created slice Slice /system/systemd-coredump.
Nov 28 02:33:35 arch-framework-d systemd[1]: Started Process Core Dump (PID 1849/UID 0).
Nov 28 02:33:36 arch-framework-d (sd-parse-elf)[1852]: Could not parse number of program headers from core file: invalid `Elf' handle
Nov 28 02:33:36 arch-framework-d systemd-coredump[1850]: [?] Process 1810 (llama-server) of user 1000 dumped core.
                                                         
                                                         Module /usr/local/lib64/libggml-hip.so.0.9.4 without build-id.
                                                         Module /usr/local/lib64/libggml-hip.so.0.9.4
                                                         Stack trace of thread 1812:
                                                         #0  0x00007fa29861d3cc n/a (/usr/lib64/libc.so.6 + 0x743cc)
                                                         #1  0x00007fa2985c318e n/a (/usr/lib64/libc.so.6 + 0x1a18e)
                                                         #2  0x00007fa2985aa6d0 n/a (/usr/lib64/libc.so.6 + 0x16d0)
                                                         #3  0x00007fa25ae56f83 n/a (/opt/rocm-6.4.4/lib/libhsa-runtime64.so.1.15.60404 + 0x8ff83)
                                                         #4  0x00007fa25ae55d67 n/a (/opt/rocm-6.4.4/lib/libhsa-runtime64.so.1.15.60404 + 0x8ed67)
                                                         #5  0x00007fa25adf989d n/a (/opt/rocm-6.4.4/lib/libhsa-runtime64.so.1.15.60404 + 0x3289d)
                                                         #6  0x00007fa29869e5ac n/a (/usr/lib64/libc.so.6 + 0xf55ac)
                                                         ELF object binary architecture: AMD x86-64
Nov 28 02:33:36 arch-framework-d systemd[1]: systemd-coredump@0-1-1849_1850-0.service: Deactivated successfully.
Nov 28 02:33:36 arch-framework-d systemd[1]: systemd-coredump@0-1-1849_1850-0.service: Consumed 1.086s CPU time, 1.2G memory peak.

EDIT: Issue as root user with AUR llama.cpp-hip package on LTS:

Nov 28 21:32:16 arch-framework-d kernel: amdgpu 0000:c3:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:8 pasid:32770)
Nov 28 21:32:16 arch-framework-d kernel: amdgpu 0000:c3:00.0: amdgpu:  in process llama-server pid 1038 thread llama-server pid 1038)
Nov 28 21:32:16 arch-framework-d kernel: amdgpu 0000:c3:00.0: amdgpu:   in page starting at address 0x00007d5f3bf2c000 from client 10
Nov 28 21:32:16 arch-framework-d kernel: amdgpu 0000:c3:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00800932
Nov 28 21:32:16 arch-framework-d kernel: amdgpu 0000:c3:00.0: amdgpu:          Faulty UTCL2 client ID: CPF (0x4)
Nov 28 21:32:16 arch-framework-d kernel: amdgpu 0000:c3:00.0: amdgpu:          MORE_FAULTS: 0x0
Nov 28 21:32:16 arch-framework-d kernel: amdgpu 0000:c3:00.0: amdgpu:          WALKER_ERROR: 0x1
Nov 28 21:32:16 arch-framework-d kernel: amdgpu 0000:c3:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
Nov 28 21:32:16 arch-framework-d kernel: amdgpu 0000:c3:00.0: amdgpu:          MAPPING_ERROR: 0x1
Nov 28 21:32:16 arch-framework-d kernel: amdgpu 0000:c3:00.0: amdgpu:          RW: 0x0

Not sure where I would upload the backtrace. I would rather not if this is already a known bug. I don't know where I'd report this bug either (if it's even a bug).

Is anyone else having this issue or what might've caused it?

Last edited by astronomical (Yesterday 04:39:05)

Offline

#2 Yesterday 07:53:34

noctavian
Member
Registered: 2013-07-11
Posts: 19

Re: amdgpu rocm GCVM_L2_PROTECTION_FAULT_STATUS in linux-firmware-20251125

My main desktop has a Ryzen 9 7900 CPU and its iGPU is used as the primary GPU on the system. After upgrading to linux-firmware-amdgpu 20251125-1 gnome-shell started crashing constantly to the point that I couldn't use the system. I tried downgrading the kernel to 6.17.8 , I also tried using the LTS kernel 6.12.59, it would still crash. I fixed it by downgrading to linux-firmware-amdgpu 20251111-1.

My laptop with an older Ryzen 5 5600U CPU and its associated iGPU works fine with linux-firmware-amdgpu 20251125-1, no crashes whatsoever so far.

At first glance, based on your log, your issue seems different from mine.

There's also a bug report on the Freedesktop GitLab about some issue with linux-firmware-amdgpu 20251125, but the report is bare bones and seems to be a different issue.

Offline

#3 Yesterday 10:49:54

Mee
Member
Registered: Yesterday
Posts: 3

Re: amdgpu rocm GCVM_L2_PROTECTION_FAULT_STATUS in linux-firmware-20251125

Got my Framework Desktop today and just ran into this myself. Downgrading linux-firmware did fix it.

Offline

#4 Yesterday 13:33:02

ptr1337
Member
Registered: 2024-09-24
Posts: 13

Re: amdgpu rocm GCVM_L2_PROTECTION_FAULT_STATUS in linux-firmware-20251125

Hi,

Could you please bisect through the linux-firmware commits and then report it to the linux-firmware issuetracker?
Then AMD can look into this

Offline

#5 Yesterday 18:11:00

astronomical
Member
Registered: Yesterday
Posts: 3

Re: amdgpu rocm GCVM_L2_PROTECTION_FAULT_STATUS in linux-firmware-20251125

ptr1337 wrote:

Hi,

Could you please bisect through the linux-firmware commits and then report it to the linux-firmware issuetracker?
Then AMD can look into this

I can try to do that. If you have any advice on bisecting it for Arch Linux, I could definitely use it. Probably will need to just edit the existing makepkg, hopefully it's easy to swap out commits.

Offline

#6 Yesterday 19:18:08

loqs
Member
Registered: 2014-03-06
Posts: 18,678

Re: amdgpu rocm GCVM_L2_PROTECTION_FAULT_STATUS in linux-firmware-20251125

Either follow Bisecting bugs with Git or the variation below which builds in a clean chroot (requires devtools)

$ pkgctl repo clone --protocol https linux-firmware # fetch linux-firmware PKGBUILD
$ cd linux-firmware/
$ makepkg -Codd --noprepare # fetch and extract source skipping prepare and dependency installation
$ cd src/linux-firmware/
$ git bisect start
$ git bisect bad 20251125
status: waiting for both good and bad commits
status: waiting for good commit(s), bad commit known
$ git bisect good 20251111
Bisecting: 51 revisions left to test after this (roughly 6 steps)
[7b7e771fb2f7db060b268f26d5392041ac205f07] amdgpu: update navi10 firmware
$ cd ../..

Replace:

source=("git+$url.git#commit=7b7e771fb2f7db060b268f26d5392041ac205f07")
b2sums=('690a2580d5f4fc8bafb1664c2ae39818a152a2dcdc34f345f8a1abe69ef473769da4556e80dfb50bbba05bc6a0b4186f2577b52020d1ad1bc1f4bcd1ae048453')

with:

source=("git+$url.git#commit=7b7e771fb2f7db060b268f26d5392041ac205f07")
b2sums=('SKIP')

and add a pkgver function:

pkgver() {
  cd ${pkgbase}

  # Commit date + short rev
  echo $(TZ=UTC git show -s --pretty=%cd --date=format-local:%Y%m%d HEAD).$(git rev-parse --short HEAD)
}

Then

$ pkgctl build # to build the package

Then install either just the amdgpu subpackage or all of the ones required and test.  Then cd linux-firmware/src/linux-firmware and report the result to git bisect to get the next commit ID.  Copy that into the PKGBUILD and repeat from the build step until git has located the causal commit.

Last edited by loqs (Yesterday 19:21:46)

Offline

#7 Yesterday 19:30:12

wwmm
Member
From: Rio de Janeiro, Brazil
Registered: 2010-08-14
Posts: 29

Re: amdgpu rocm GCVM_L2_PROTECTION_FAULT_STATUS in linux-firmware-20251125

I've opened an issue about this in upstream page https://gitlab.freedesktop.org/drm/amd/-/issues/4737. But I am not sure if I will have time to prepare a bisect today.

Offline

#8 Yesterday 19:56:05

Ekim0789
Member
Registered: Yesterday
Posts: 1

Re: amdgpu rocm GCVM_L2_PROTECTION_FAULT_STATUS in linux-firmware-20251125

I also have a Framework Desktop with an AMD Max+ 395 128GB, I ran into this same issue and downgrading only linux-firmware-amdgpu to 20251111 fixed it for me.

Offline

#9 Yesterday 21:55:47

Mee
Member
Registered: Yesterday
Posts: 3

Re: amdgpu rocm GCVM_L2_PROTECTION_FAULT_STATUS in linux-firmware-20251125

I've done a bisect and opened an issue: https://gitlab.freedesktop.org/drm/amd/-/issues/4738

Offline

#10 Yesterday 22:22:20

astronomical
Member
Registered: Yesterday
Posts: 3

Re: amdgpu rocm GCVM_L2_PROTECTION_FAULT_STATUS in linux-firmware-20251125

I also bisected it to commit:

85173cf441ec93d8947ec7ddcdd1648a720dcf8d

The BISECT_LOG:

git bisect start '--' 'amdgpu'
# status: waiting for both good and bad commits
# good: [6fc940781a013ad837ed8fea326d2b897467bbc3] Merge branch 'robot/patch-0-1762826844' into 'main'
git bisect good 6fc940781a013ad837ed8fea326d2b897467bbc3
# status: waiting for bad commit, 1 good commit known
# bad: [4ee5122b3f58e4c07951746c4425e2f4f42e860f] Merge branch 'robot/pr-0-1764088604' into 'main'
git bisect bad 4ee5122b3f58e4c07951746c4425e2f4f42e860f
# bad: [7b7e771fb2f7db060b268f26d5392041ac205f07] amdgpu: update navi10 firmware
git bisect bad 7b7e771fb2f7db060b268f26d5392041ac205f07
# bad: [d9f867fa205b1fee4c408da66a4270f41a9eb5e1] amdgpu: update VCN 4.0.6 firmware
git bisect bad d9f867fa205b1fee4c408da66a4270f41a9eb5e1
# good: [cf102ce55053617d58e3c8aad7947990970f820c] amdgpu: update VCN 4.0.2 firmware
git bisect good cf102ce55053617d58e3c8aad7947990970f820c
# good: [e281c8286079630e558243b3ef4eedabfd3f6f8c] amdgpu: update GC 11.0.4 firmware
git bisect good e281c8286079630e558243b3ef4eedabfd3f6f8c
# bad: [85173cf441ec93d8947ec7ddcdd1648a720dcf8d] amdgpu: update GC 11.5.1 firmware
git bisect bad 85173cf441ec93d8947ec7ddcdd1648a720dcf8d
# good: [4f77f6c847761a6b21ead76b63945bdc24c085da] amdgpu: update PSP 13.0.11 firmware
git bisect good 4f77f6c847761a6b21ead76b63945bdc24c085da
# first bad commit: [85173cf441ec93d8947ec7ddcdd1648a720dcf8d] amdgpu: update GC 11.5.1 firmware

EDIT: I used the following PKGBUILD and the following command-line to test it:

$ updpkgsums \
  && pkgctl build \
  && scp linux-firmware-amdgpu*.pkg.tar.zst $FRAMEWORK:~/firmware
$ ssh $FRAMEWORK
$ sudo pacman -U ~/firmware/linux-firmware-amdgpu*.pkg \
  && sleep 5 \
  && systemctl restart

Edits made to the PKGBUILD (via git diff --patch PKGBUILD):

* Used pkgver() for the version to be correct
* Used _bisect to track the commit

diff --git a/PKGBUILD b/PKGBUILD
index ac7fedf..f0eb422 100644
--- a/PKGBUILD
+++ b/PKGBUILD
@@ -1,6 +1,7 @@
 # Maintainer: Tobias Powalowski <tpowa@archlinux.org>
 # Contributor: Thomas Bächler <thomas@archlinux.org>
 
+_bisect=4f77f6c847761a6b21ead76b63945bdc24c085da
 pkgbase=linux-firmware
 pkgname=(
   linux-firmware
@@ -25,7 +26,7 @@ pkgname=(
   linux-firmware-radeon
   linux-firmware-realtek
 )
-pkgver=20251125
+pkgver=20251111.r34.g4f77f6c
 pkgrel=1
 pkgdesc="Firmware files for Linux"
 url="https://gitlab.com/kernel-firmware/linux-firmware"
@@ -41,12 +42,17 @@ options=(
   !debug
   !strip
 )
-source=("git+$url.git?signed#tag=${pkgver}")
-b2sums=('690a2580d5f4fc8bafb1664c2ae39818a152a2dcdc34f345f8a1abe69ef473769da4556e80dfb50bbba05bc6a0b4186f2577b52020d1ad1bc1f4bcd1ae048453')
+source=("git+$url.git#commit=$_bisect")
+b2sums=('8a80c170c48c9a6cf0de876e32235ac23faa83c013e32a104f23d0af80528b23c4dc812e353a8594147371153260bece352618792176ba03f1a05fcef687d603')
 validpgpkeys=(
   4CDE8575E547BF835FE15807A31B6BD72486CFD6 # Josh Boyer <jwboyer@fedoraproject.org>
 )
 
+pkgver() {
+  cd "$pkgname"
+  git describe --long --abbrev=7 | sed 's/\([^-]*-g\)/r\1/;s/-/./g'
+}
+
 _backports=(
 )

Last edited by astronomical (Yesterday 22:55:25)

Offline

#11 Yesterday 23:44:55

loqs
Member
Registered: 2014-03-06
Posts: 18,678

Re: amdgpu rocm GCVM_L2_PROTECTION_FAULT_STATUS in linux-firmware-20251125

As a cross check if you add 85173cf441ec93d8947ec7ddcdd1648a720dcf8d to the reverts array with the source set back to `source=("git+$url.git?signed#tag=${pkgver}")` can you reproduce the issue?

Offline

#12 Today 12:20:49

Mee
Member
Registered: Yesterday
Posts: 3

Re: amdgpu rocm GCVM_L2_PROTECTION_FAULT_STATUS in linux-firmware-20251125

loqs wrote:

As a cross check if you add 85173cf441ec93d8947ec7ddcdd1648a720dcf8d to the reverts array with the source set back to `source=("git+$url.git?signed#tag=${pkgver}")` can you reproduce the issue?

Just tried it, building it with the revert has no issues with rocm.

Offline

#13 Today 14:44:45

loqs
Member
Registered: 2014-03-06
Posts: 18,678

Re: amdgpu rocm GCVM_L2_PROTECTION_FAULT_STATUS in linux-firmware-20251125

To help with the information requested in https://gitlab.freedesktop.org/drm/amd/ … te_3215452

6.17.9 contains https://github.com/torvalds/linux/commi … f9d9d98bd1 as https://git.kernel.org/pub/scm/linux/ke … e4e0efd5d8

Diff adding the rocm patch

diff --git a/PKGBUILD b/PKGBUILD
index bb148a5..63a6db7 100644
--- a/PKGBUILD
+++ b/PKGBUILD
@@ -14,14 +14,20 @@ depends=(
 )
 makedepends=('cmake')
 source=("rocm-$pkgver.tar.gz::https://github.com/ROCm/rocm-systems/archive/refs/tags/rocm-$pkgver.tar.gz"
+        "https://github.com/ROCm/rocm-systems/commit/770f30bc4c72d763742e39932e2c0583813d531f.patch"
         'rocm-ld.conf'
         'rocm-profile.sh')
 b2sums=('51df17b0724bb588a97cc7cd0fad51fb3e2173ecd53783bdec0f7e7134ab1a772431bcc41d467ba3b1085be6f6ef33e83af948a3c5879f0db243c67bc04e0d1e'
+        '2f7239efd8759bf75cd3293112d156323e15bf6671843a43a69c1668a2c53c7161e051461500acce10778f6f36a873bb25aa5b99bf4bb00bf4425ea9def3e0c0'
         'd045c357d8e7e8a4840ab137404f12cd08419444ffc478046c13ed3bd13a5d33358c1443bf76ee571a7a062454e2bdda1a5507a70edbd001bce004f18775e4b2'
         '4372bcbe97d7c95d4918ad4beacc4fe9bfc8bfb8cafcf08d9ebbcba7df3e3bf535ff51f90c2d0f653858b0ae03b108ac3cb32b61b4ecac3abb609acc06be3ee3')
 url='https://rocm.docs.amd.com/en/latest/'
 _projectDir="rocm-systems-rocm-$pkgver/projects/$pkgname"
 
+prepare() {
+  patch -Np1 -i ../770f30bc4c72d763742e39932e2c0583813d531f.patch -d rocm-systems-rocm-$pkgver
+}
+
 build() {
   local cmake_args=(
     -Wno-dev

Offline

Board footer

Powered by FluxBB