VFIO gpu passthrough kernel regression

SimonP · 2024-12-09 19:04:59

@gromit could you build a 6.12 kernel with the patch below for testing?

From 769baa867246228d89936fc99193b04b7227a7a8 Mon Sep 17 00:00:00 2001
From: SimonP <xxx@xxx.com>
Date: Mon, 9 Dec 2024 18:23:12 +0100
Subject: [PATCH] Revert "KVM: SVM: Disallow guest from changing userspace's
 MSR_AMD64_DE_CFG value"

This reverts commit 74a0e79df68a8042fb84fd7207e57b70722cf825.
---
 arch/x86/kvm/svm/svm.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index dd15cc635655..c088d4241fff 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3201,13 +3201,8 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 		if (data & ~supported_de_cfg)
 			return 1;
 
-		/*
-		 * Don't let the guest change the host-programmed value.  The
-		 * MSR is very model specific, i.e. contains multiple bits that
-		 * are completely unknown to KVM, and the one bit known to KVM
-		 * is simply a reflection of hardware capabilities.
-		 */
-		if (!msr->host_initiated && data != svm->msr_decfg)
+		/* Don't allow the guest to change a bit, #GP */
+		if (!msr->host_initiated && (data ^ supported_de_cfg))
 			return 1;
 
 		svm->msr_decfg = data;
-- 
2.45.2

Last edited by SimonP (2024-12-09 19:05:12)

kevinlpowell · 2024-12-09 20:04:55

a few possibly interesting updates:

kernel 7056c4e turns out to be 'mostly bad' rather than always bad. There is, for me, about 1 chance in 20 that I will be able to successfully start my guest Windows on 7056c4e. This seems to be decided when the linux host boots.

I tried the kvm.enable_virt_at_load=0 kernel parameter with mixed results. On 7056c4e, that parameter significantly increases the chances that I'll be able to start the Windows guest. (3 out of 3 successful trials). However on later kernels (e.g. the arch 6.12 package), I'm still unable to start my Windows guest and kvm.enable_virt_at_load=0 does not seem to have any effect.

@pysicsBTW when I encounter this problem, the vms appear to be hard crashing. I'm checking for alive-ness with ssh rather than RDP (as I already have sshd set up on the windows guest). As far as I can tell, the guest OS never actually boots.

Ranguvar · 2024-12-09 21:21:19

SimonP wrote:

@gromit could you build a 6.12 kernel with the patch below for testing?
</snip>

I started building this over 6.12.4-zen1-1. In a few hours I can test.
Please let me know if you'd like me to test any other built kernels while I have everything down.

Kevin, I have the same result on 6.12 or later, kvm.enable_virt_at_load=0 didn't help me.
Good idea checking ssh to Windows.

Thank you for all of your contributions.

Last edited by Ranguvar (2024-12-09 21:23:20)

physicsBTW · 2024-12-09 22:51:27

@loqs @kevinpowell
1) I tried kvm.enable_virt_at_load = 0 on a problem kernel, verified with /proc/cmdline that it was applied and it had no effect. I removed this parameter after testing it.

2) I reinstalled gromit's copy of 7056c4e2a13a, verified

uname -r
6.11.0-rc7-1-mainline-00107-g7056c4e2a13a

Made three vm load attempts, shutting down my computer, unplugging it, and holding down the power button for 10 seconds between each one. All 3/3 attempts the vm loaded correctly.

Testing simon's results now.

Last edited by physicsBTW (2024-12-10 01:30:38)

kevinlpowell · 2024-12-10 01:45:28

@SimonP -- good news from my machine. Applied your patch at v6.12 resulting in

uname -a
Linux orochi 6.12.0-1-git-dirty #15 SMP PREEMPT_DYNAMIC Tue, 10 Dec 2024 00:09:14 +0000 x86_64 GNU/Linux

and my guest windows with gpu passthrough started right up. thanks for that.

physicsBTW · 2024-12-10 01:59:00

@SimonP
commit 74a0e79df68a8042fb84fd7207e57b70722cf825 loads in for me without patching.
I'm thinking that we are having separate issues.

gromit · 2024-12-10 03:01:17

How did you revert/build the commit? When I revert it on top of 6.12.y I get the following compile error:

arch/x86/kvm/svm/svm.c: In function ‘svm_set_msr’:
arch/x86/kvm/svm/svm.c:3203:53: error: ‘msr_entry’ undeclared (first use in this function); did you mean ‘psc_entry’?
 3203 |                 if (!msr->host_initiated && (data ^ msr_entry.data))
      |                                                     ^~~~~~~~~
      |                                                     psc_entry

This is with the following revert patch:

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 9df3e1e5ae81..759cd4326d2a 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3199,13 +3199,8 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 		if (data & ~supported_de_cfg)
 			return 1;
 
-		/*
-		 * Don't let the guest change the host-programmed value.  The
-		 * MSR is very model specific, i.e. contains multiple bits that
-		 * are completely unknown to KVM, and the one bit known to KVM
-		 * is simply a reflection of hardware capabilities.
-		 */
-		if (!msr->host_initiated && data != svm->msr_decfg)
+		/* Don't allow the guest to change a bit, #GP */
+		if (!msr->host_initiated && (data ^ msr_entry.data))
 			return 1;
 
 		svm->msr_decfg = data;
-- 
2.47.1

physicsBTW · 2024-12-10 04:32:38

git bisect start
# status: waiting for both good and bad commits
# bad: [adc218676eef25575469234709c2d87185ca223a] Linux 6.12
git bisect bad adc218676eef25575469234709c2d87185ca223a
# status: waiting for good commit(s), bad commit known
# good: [98f7e32f20d28ec452afb208f9cffc08448a2652] Linux 6.11
git bisect good 98f7e32f20d28ec452afb208f9cffc08448a2652
# good: [509d2cd12a10d057fdf72f565b930f9a81140d59] Merge tag 'Smack-for-6.12' of https://github.com/cschaufler/smack-next
git bisect good 509d2cd12a10d057fdf72f565b930f9a81140d59
# bad: [356a0319456810f3a5618353f6ca3b0ef9965479] Merge tag 'tty-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
git bisect bad 356a0319456810f3a5618353f6ca3b0ef9965479
# bad: [3a37872316c2e3288e09a1322221c83e5929768d] Merge tag 'pci-v6.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
git bisect bad 3a37872316c2e3288e09a1322221c83e5929768d
# bad: [440b65232829fad69947b8de983c13a525cc8871] Merge tag 'bpf-next-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
git bisect bad 440b65232829fad69947b8de983c13a525cc8871
# bad: [617a814f14b8914271f7a70366d72c6196d17663] Merge tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
git bisect bad 617a814f14b8914271f7a70366d72c6196d17663
# good: [775d28fd45a2f5e58d8927ce77693398b1f074a6] mm: remove isolate_lru_page()
git bisect good 775d28fd45a2f5e58d8927ce77693398b1f074a6
# good: [7056c4e2a13a61f4e8a9e8ce27cd499f27e0e63b] Merge tag 'kvm-x86-generic-6.12' of https://github.com/kvm-x86/linux into HEAD
git bisect good 7056c4e2a13a61f4e8a9e8ce27cd499f27e0e63b
# good: [c09dd2bb5748075d995ae46c2d18423032230f9b] Merge branch 'kvm-redo-enable-virt' into HEAD
git bisect good c09dd2bb5748075d995ae46c2d18423032230f9b
# good: [55e6f8f29d6ac76126ad1c8000b4c3626cf4b176] Merge tag 'kvm-x86-svm-6.12' of https://github.com/kvm-x86/linux into HEAD
git bisect good 55e6f8f29d6ac76126ad1c8000b4c3626cf4b176
# good: [056f8c437dc33e9e8e64b9344e816d7d46c06c16] Merge tag 'ext4_for_linus-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
git bisect good 056f8c437dc33e9e8e64b9344e816d7d46c06c16
# good: [df7e1286b1dc3d6cff952cf3ef4f0c36831e2fbb] mm: care about shadow stack guard gap when getting an unmapped area
git bisect good df7e1286b1dc3d6cff952cf3ef4f0c36831e2fbb
# good: [5731aacd54a883dd2c1a5e8c85e1fe78fc728dc7] KVM: use follow_pfnmap API
git bisect good 5731aacd54a883dd2c1a5e8c85e1fe78fc728dc7
# good: [658be46520ce480a44fe405730a1725166298f27] mm: support poison recovery from copy_present_page()
git bisect good 658be46520ce480a44fe405730a1725166298f27
# bad: [325efb16da2c840e165d9b620fec8049d4d664cc] mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios
git bisect bad 325efb16da2c840e165d9b620fec8049d4d664cc
# bad: [659c55ef981bb63355a65ffc3b3b5cad562b806a] mm/vma: return the exact errno in vms_gather_munmap_vmas()
git bisect bad 659c55ef981bb63355a65ffc3b3b5cad562b806a
# bad: [f2c5101be43677c227974912a043da29a62743ef] memcg: cleanup with !CONFIG_MEMCG_V1
git bisect bad f2c5101be43677c227974912a043da29a62743ef
# bad: [fd00be9afa1d64c90ae20a1307da1bdb809b3d55] mm/show_mem.c: report alloc tags in human readable units
git bisect bad fd00be9afa1d64c90ae20a1307da1bdb809b3d55
# first bad commit: [fd00be9afa1d64c90ae20a1307da1bdb809b3d55] mm/show_mem.c: report alloc tags in human readable units

I'm checking reverting fd00be9afa1d64c90ae20a1307da1bdb809b3d55 to see if it does anything, but this result seems a little strange.
Edit - Made a wrong turn somewhere I think, or a non-clean build issue.

Last edited by physicsBTW (2024-12-10 04:47:10)

kevinlpowell · 2024-12-10 06:19:23

I don't know if this will help at all: when I made my patched build of 6.12 -- the first thing I tried was simply:

git revert  74a0e79

and I tried that based on an incomplete understanding of SimonP's post. simply reverting 74a0e79 results in the build error that gromit posted. It was at that point I noticed that SimonP's patch is not what one gets from reverting 74a0e79, but is instead a fixup which actually compiles. So I went back to an unmodified 6.12 (at commit adc2186) and applied, verbatim, the patch from SimonP's post.

Last edited by kevinlpowell (2024-12-10 06:25:52)

SimonP · 2024-12-10 08:14:01

It doesn't revert cleanly. Use my patch above.

Ranguvar · 2024-12-10 08:43:16

physicsBTW wrote:

@SimonP
commit 74a0e79df68a8042fb84fd7207e57b70722cf825 loads in for me without patching.
I'm thinking that we are having separate issues.

My results seem to agree, there could be separate issues at play.

6.12.4-zen1-1 with Simon's patch did not work for me. I only had time to test twice. All VM cores pegged to 100%, even if left for several minutes.

gromit's linux-mainline 7056c4e2a13a worked well immediately on first try.

Just let me know if there are any other specific built kernels I should test tomorrow.

Last edited by Ranguvar (2024-12-10 08:43:31)

SimonP · 2024-12-10 08:50:05

Well, this is unfortunate. I have a very similar setup to @Ranguvar (including the same CPU) so I expected success there at least. I'll keep looking at this thread but as reverting 74a0e79df68a8042fb84fd7207e57b70722cf825 fixed the issue for me I can't be of any more help.

loqs · 2024-12-10 17:03:19

SimonP's upstream report https://lore.kernel.org/kvm/52914da7-a9 … ilbox.org/
Edit:
6.12.4 with the diagnostic patch from https://lore.kernel.org/kvm/Z1hiiz40nUq … oogle.com/ applied which adds two new warning messages which should trigger if affected:
linux-headers-6.12.4.arch1-1.1-x86_64.pkg.tar.zst/linux-6.12.4.arch1-1.1-x86_64.pkg.tar.zst.

Last edited by loqs (2024-12-10 20:56:40)

Ranguvar · 2024-12-10 22:04:05

My dmesg output with that kernel is similar to Simon's:

[   74.302915] virbr0: port 1(vnet0) entered blocking state
[   74.302924] virbr0: port 1(vnet0) entered disabled state
[   74.302931] vnet0: entered allmulticast mode
[   74.302969] vnet0: entered promiscuous mode
[   74.304700] virbr0: port 1(vnet0) entered blocking state
[   74.304703] virbr0: port 1(vnet0) entered listening state
[   76.316421] vfio-pci 0000:0e:00.0: enabling device (0002 -> 0003)
[   76.415850] virbr0: port 1(vnet0) entered learning state
[   76.445465] vfio-pci 0000:0e:00.1: enabling device (0000 -> 0002)
[   76.471323] pcieport 0000:03:06.0: unlocked secondary bus reset via: __pci_reset_function_locked+0x41/0x70
[   78.536043] virbr0: port 1(vnet0) entered forwarding state
[   78.536047] virbr0: topology change detected, propagating
[   85.973721] kvm_amd: DE_CFG current = 0, WRMSR = 2
[  329.196806] sched: DL replenish lagged too much

Ranguvar · 2024-12-11 07:14:32

Tested 6.12.4 while dropping the if statement entirely as Sean suggests at https://lore.kernel.org/kvm/Z1jEDFpanEI … oogle.com/
Issue was unchanged for me.

loqs · 2024-12-11 13:34:06

Ranguvar wrote:

Issue was unchanged for me.

Were you able to bisect the cause of your issue?

Ranguvar · 2024-12-11 13:39:18

loqs wrote:

Ranguvar wrote:
Issue was unchanged for me.
Were you able to bisect the cause of your issue?

I haven't, but 7056c4e from gromit's package worked well for me.

I can test other builds if it'd help.

loqs · 2024-12-11 14:54:26

@Ranguvar I would suggest starting a fresh and using gromits builds and see where that leads:

$ git bisect start
status: waiting for both good and bad commits
$ git bisect bad v6.12
status: waiting for good commit(s), bad commit known
$ git bisect good v6.11
Bisecting: 7334 revisions left to test after this (roughly 13 steps)
[509d2cd12a10d057fdf72f565b930f9a81140d59] Merge tag 'Smack-for-6.12' of https://github.com/cschaufler/smack-next
$ URL="https://pkgbuild.com/~gromit/linux-bisection-kernels/linux-mainline-$(git describe --long --abbrev=7 | sed 's/\([^-]*-g\)/r\1/;s/-/./g')-1-x86_64.pkg.tar.zst"; curl --output /dev/null --silent --head --fail "${URL}" && echo "sudo pacman -U ${URL/\~/\~}" || echo "Not in cache"
sudo pacman -U https://pkgbuild.com/~gromit/linux-bisection-kernels/linux-mainline-v6.11.r7272.g509d2cd-1-x86_64.pkg.tar.zst

Ranguvar · 2024-12-12 17:49:40

Found the commit, but I did not expect it to be in scheduler code.

[ranguvar@khufu linux-torvalds]$ git bisect start
status: waiting for both good and bad commits
[ranguvar@khufu linux-torvalds]$ git bisect bad v6.12
status: waiting for good commit(s), bad commit known
[ranguvar@khufu linux-torvalds]$ git bisect good v6.11
Bisecting: 7334 revisions left to test after this (roughly 13 steps)
[509d2cd12a10d057fdf72f565b930f9a81140d59] Merge tag 'Smack-for-6.12' of https://github.com/cschaufler/smack-next
[ranguvar@khufu linux-torvalds]$ git bisect bad 2004cef11ea0  # This commit is also halfway between, confirmed broken many times
Bisecting: 3701 revisions left to test after this (roughly 12 steps)
[7b17f5ebd5fc5e9275eaa5af3d0771f2a7b01bbf] Merge tag 'soc-dt-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
[ranguvar@khufu linux-torvalds]$ git describe
v6.11-3635-g7b17f5ebd5fc
[ranguvar@khufu linux-torvalds]$ git bisect good 7b17f5ebd5fc
Bisecting: 1825 revisions left to test after this (roughly 11 steps)
[3a7101e9b27fe97240c2fd430c71e61262447dd1] Merge tag 'powerpc-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
[ranguvar@khufu linux-torvalds]$ git describe
v6.11-5511-g3a7101e9b27f
[ranguvar@khufu linux-torvalds]$ git bisect good 3a7101e9b27f
Bisecting: 912 revisions left to test after this (roughly 10 steps)
[6dcc304f85898b099b35c63748c5e11ba56d0c8a] drm/amd/display: Resolve Coverity Issues
[ranguvar@khufu linux-torvalds]$ git describe
v6.11-rc5-913-g6dcc304f8589
[ranguvar@khufu linux-torvalds]$ git bisect good 6dcc304f8589
Bisecting: 469 revisions left to test after this (roughly 9 steps)
[ae2c6d8b3b88c176dff92028941a4023f1b4cb91] Merge tag 'drm-xe-next-fixes-2024-09-12' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-next
[ranguvar@khufu linux-torvalds]$ git describe
v6.11-rc7-1356-gae2c6d8b3b88
[ranguvar@khufu linux-torvalds]$ git bisect good ae2c6d8b3b88
Bisecting: 242 revisions left to test after this (roughly 8 steps)
[a65b3c3ed49a3b8068c002e98c90f8594927ff25] Merge tag 'hid-for-linus-2024091602' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid
[ranguvar@khufu linux-torvalds]$ git describe
v6.11-5738-ga65b3c3ed49a
[ranguvar@khufu linux-torvalds]$ git bisect good a65b3c3ed49a
Bisecting: 148 revisions left to test after this (roughly 7 steps)
[cff06a799dbe81f3a697ae7c805eaf88d30c2308] Merge patch series "smartpqi updates"
[ranguvar@khufu linux-torvalds]$ git describe
v6.11-rc1-94-gcff06a799dbe
[ranguvar@khufu linux-torvalds]$ git bisect good cff06a799dbe
Bisecting: 74 revisions left to test after this (roughly 6 steps)
[839c4f596f898edc424070dc8b517381572f8502] Merge tag 'mm-hotfixes-stable-2024-09-19-00-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
[ranguvar@khufu linux-torvalds]$ git describe
v6.11-7262-g839c4f596f89
[ranguvar@khufu linux-torvalds]$ git bisect good 839c4f596f89
Bisecting: 37 revisions left to test after this (roughly 5 steps)
[152e11f6df293e816a6a37c69757033cdc72667d] sched/fair: Implement delayed dequeue
[ranguvar@khufu linux-torvalds]$ git describe
v6.11-rc1-42-g152e11f6df29
[ranguvar@khufu linux-torvalds]$ git bisect skip  # Fails to compile
Bisecting: 37 revisions left to test after this (roughly 5 steps)
[54a58a78779169f9c92a51facf6de7ce94962328] sched/fair: Implement DELAY_ZERO
[ranguvar@khufu linux-torvalds]$ git describe
v6.11-rc1-43-g54a58a787791
[ranguvar@khufu linux-torvalds]$ git bisect good 54a58a787791
Bisecting: 17 revisions left to test after this (roughly 4 steps)
[6b9ccbc033cf179956a37fef3ee415bdc3029d2f] kthread: Fix task state in kthread worker if being frozen
[ranguvar@khufu linux-torvalds]$ git describe
v6.11-rc1-62-g6b9ccbc033cf
[ranguvar@khufu linux-torvalds]$ git bisect bad 6b9ccbc033cf
Bisecting: 8 revisions left to test after this (roughly 3 steps)
[4686cc598f669dea1b50dde1568e6c65c355bc67] sched: Clean up DL server vs core sched
[ranguvar@khufu linux-torvalds]$ git describe
v6.11-rc1-53-g4686cc598f66
[ranguvar@khufu linux-torvalds]$ git bisect good 4686cc598f66
Bisecting: 3 revisions left to test after this (roughly 2 steps)
[b2d70222dbf2a2ff7a972a685d249a5d75afa87f] sched: Add put_prev_task(.next)
[ranguvar@khufu linux-torvalds]$ git describe
v6.11-rc1-58-gb2d70222dbf2
[ranguvar@khufu linux-torvalds]$ git bisect bad b2d70222dbf2
Bisecting: 1 revision left to test after this (roughly 1 step)
[436f3eed5c69c1048a5754df6e3dbb291e5cccbd] sched: Combine the last put_prev_task() and the first set_next_task()
[ranguvar@khufu linux-torvalds]$ git describe
v6.11-rc1-56-g436f3eed5c69
[ranguvar@khufu linux-torvalds]$ git bisect good 436f3eed5c69
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[bd9bbc96e8356886971317f57994247ca491dbf1] sched: Rework dl_server
[ranguvar@khufu linux-torvalds]$ git describe
v6.11-rc1-57-gbd9bbc96e835
[ranguvar@khufu linux-torvalds]$ git bisect bad bd9bbc96e835
bd9bbc96e8356886971317f57994247ca491dbf1 is the first bad commit
commit bd9bbc96e8356886971317f57994247ca491dbf1 (HEAD)
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Wed Aug 14 00:25:55 2024 +0200

    sched: Rework dl_server

    When a task is selected through a dl_server, it will have p->dl_server
    set, such that it can account runtime to the dl_server, see
    update_curr_task().

    Currently p->dl_server is set in pick*task() whenever it goes through
    the dl_server, clearing it is a bit of a mess though. The trivial
    solution is clearing it on the final put (now that we have this
    location).

    However, this gives a problem when:

            p = pick_task(rq);
            if (p)
                    put_prev_set_next_task(rq, prev, next);

    picks the same task but through a different path, notably when it goes
    from picking through the dl_server to a direct pick or vice-versa. In
    that case we cannot readily determine wether we should clear or
    preserve p->dl_server.

    An additional complication is pick_*task() setting p->dl_server for a
    remote pick, it might still need to update runtime before it schedules
    the core_pick.

    Close all these holes and remove all the random clearing of
    p->dl_server by:

     - having pick_*task() manage rq->dl_server

     - having the final put_prev_task() clear p->dl_server

     - having the first set_next_task() set p->dl_server = rq->dl_server

     - complicate the core_sched code to save/restore rq->dl_server where
       appropriate.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20240813224016.259853414@infradead.org

 kernel/sched/core.c     | 40 +++++++++++++++-------------------------
 kernel/sched/deadline.c |  2 +-
 kernel/sched/fair.c     | 10 ++--------
 kernel/sched/sched.h    | 14 ++++++++++++++
 4 files changed, 32 insertions(+), 34 deletions(-)

I don't use any special config with regard to scheduling. Sometimes I use Arch's packaged linux and sometimes linux-zen. This setup has worked for me since 5.10, before the advent of EEVDF, and likely earlier.
I use

iommu=pt isolcpus=1-7,17-23 rcu_nocbs=1-7,17-23 nohz_full=1-7,17-23

in kernel cmdline. Those cores are allocated to the VM, and some or all will show 100% usage while the guest is locked and some tasks stall on the host. Removing iommu=pt had no effect. Removing the core isolation reliably locks up the host as soon as the guest starts and pins its threads.

*Edited above for clarity

This appears to block kernel tasks for many minutes (indefinitely).
dmesg beginning when VM runs, no new unusual messages appear when VM is forcibly stopped.

[  565.381219] vfio-pci 0000:0e:00.0: enabling device (0002 -> 0003)
[  565.512119] vfio-pci 0000:0e:00.1: enabling device (0000 -> 0002)
[  565.538102] pcieport 0000:03:06.0: unlocked secondary bus reset via: __pci_reset_function_locked+0x41/0x70
[  642.153635] hrtimer: interrupt took 3370 ns
[  681.155366] sched: DL replenish lagged too much
[ 1720.226930] INFO: task khugepaged:263 blocked for more than 122 seconds.
[ 1720.226934]       Tainted: P           OE      6.11.0-rc1-1-git-00057-gbd9bbc96e835 #12
[ 1720.226935] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1720.226937] task:khugepaged      state:D stack:0     pid:263   tgid:263   ppid:2      flags:0x00004000
[ 1720.226941] Call Trace:
[ 1720.226942]  <TASK>
[ 1720.226945]  __schedule+0x382/0x1430
[ 1720.226951]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 1720.226955]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 1720.226959]  schedule+0x27/0xf0
[ 1720.226962]  schedule_timeout+0x12f/0x160
[ 1720.226967]  wait_for_completion+0x86/0x170
[ 1720.226971]  __flush_work+0x1bf/0x2c0
[ 1720.226975]  ? __pfx_wq_barrier_func+0x10/0x10
[ 1720.226978]  __lru_add_drain_all+0x145/0x1f0
[ 1720.226982]  khugepaged+0x65/0x940
[ 1720.226987]  ? __pfx_autoremove_wake_function+0x10/0x10
[ 1720.226991]  ? __pfx_khugepaged+0x10/0x10
[ 1720.226994]  kthread+0xd2/0x100
[ 1720.226998]  ? __pfx_kthread+0x10/0x10
[ 1720.227002]  ret_from_fork+0x34/0x50
[ 1720.227004]  ? __pfx_kthread+0x10/0x10
[ 1720.227007]  ret_from_fork_asm+0x1a/0x30
[ 1720.227014]  </TASK>

Last edited by Ranguvar (2024-12-12 20:58:39)

loqs · 2024-12-13 17:03:38

I would suggest contacting Peter Zijlstra as the author of the causal commit if you have not already done so. There is no dedicated list for the scheduler so I suggest using either the main linux-kernel or regressions mailing lists.

gromit · 2024-12-13 19:20:51

$ ./scripts/get_maintainer.pl --no-rolestats -f kernel/sched/core.c
Ingo Molnar <mingo@redhat.com>
Peter Zijlstra <peterz@infradead.org>
Juri Lelli <juri.lelli@redhat.com>
Vincent Guittot <vincent.guittot@linaro.org>
Dietmar Eggemann <dietmar.eggemann@arm.com>
Steven Rostedt <rostedt@goodmis.org>
Ben Segall <bsegall@google.com>
Mel Gorman <mgorman@suse.de>
Valentin Schneider <vschneid@redhat.com>
linux-kernel@vger.kernel.org

Ranguvar · 2024-12-14 06:36:16

New to this process, but reported: https://lore.kernel.org/regressions/jGQ … ar.io/T/#u

Thank you both for the assistance.

Precific · 2024-12-22 00:34:31

I'm not sure if it's related to Ranguvar's problem, or if it's the third 6.12 regression in this thread ^^

In my case, the symptoms match the original post. Linux and Windows guests both fail to initialize the GPU. I also had it work once (first boot with 6.12) but, other than that, it fails reliably.
Do note that I'm using Manjaro, but since we're talking about a mainline kernel-related regression and this is the only discussion I could find, I thought I'd chime in.
Specs: 7950X3D, ASrock X670E Steel Legend, 6700XT for the guest.

Dmesg of a linux guest failing to initialize the GPU (one of two error variations I've seen):

[   10.245100] [drm] amdgpu kernel modesetting enabled.
[   10.245173] amdgpu: Virtual CRAT table created for CPU
[   10.245182] amdgpu: Topology: Add CPU node
[   10.245480] [drm] initializing kernel modesetting (NAVY_FLOUNDER 0x1002:0x73DF 0x1002:0x0E36 0xC1).
[   10.245492] [drm] register mmio base: 0x81A00000
[   10.245493] [drm] register mmio size: 1048576
[   10.248861] [drm] add ip block number 0 <nv_common>
[   10.248862] [drm] add ip block number 1 <gmc_v10_0>
[   10.248863] [drm] add ip block number 2 <navi10_ih>
[   10.248864] [drm] add ip block number 3 <psp>
[   10.248864] [drm] add ip block number 4 <smu>
[   10.248865] [drm] add ip block number 5 <dm>
[   10.248866] [drm] add ip block number 6 <gfx_v10_0>
[   10.248867] [drm] add ip block number 7 <sdma_v5_2>
[   10.248867] [drm] add ip block number 8 <vcn_v3_0>
[   10.248868] [drm] add ip block number 9 <jpeg_v3_0>
[   10.248877] amdgpu 0000:05:00.0: amdgpu: Fetched VBIOS from VFCT
[   10.248878] amdgpu: ATOM BIOS: 113-D5121100-101
[   10.270097] [drm] VCN(0) decode is enabled in VM mode
[   10.270099] [drm] VCN(0) encode is enabled in VM mode
[   10.284318] [drm] JPEG decode is enabled in VM mode
[   10.284320] amdgpu 0000:05:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[   10.284359] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[   10.284365] amdgpu 0000:05:00.0: amdgpu: VRAM: 12272M 0x0000008000000000 - 0x00000082FEFFFFFF (12272M used)
[   10.284367] amdgpu 0000:05:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[   10.284375] [drm] Detected VRAM RAM=12272M, BAR=16384M
[   10.284376] [drm] RAM width 192bits GDDR6
[   10.284495] [drm] amdgpu: 12272M of VRAM memory ready
[   10.284496] [drm] amdgpu: 16042M of GTT memory ready.
[   10.284505] [drm] GART: num cpu pages 131072, num gpu pages 131072
[   10.284626] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[   12.218276] amdgpu 0000:05:00.0: amdgpu: STB initialized to 2048 entries
[   12.218333] [drm] Loading DMUB firmware via PSP: version=0x02020020
[   12.218647] [drm] use_doorbell being set to: [true]
[   12.218658] [drm] use_doorbell being set to: [true]
[   12.218667] [drm] Found VCN firmware Version ENC: 1.30 DEC: 3 VEP: 0 Revision: 4
[   12.218672] amdgpu 0000:05:00.0: amdgpu: Will use PSP to load VCN firmware
[   14.390991] [drm] psp gfx command ID_LOAD_TOC(0x20) failed and response status is (0x0)
[   14.390994] [drm:psp_hw_start [amdgpu]] *ERROR* Failed to load toc
[   14.391223] [drm:psp_hw_start [amdgpu]] *ERROR* PSP tmr init failed!
[   14.411423] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading failed
[   14.411604] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22
[   14.411784] amdgpu 0000:05:00.0: amdgpu: amdgpu_device_ip_init failed
[   14.411785] amdgpu 0000:05:00.0: amdgpu: Fatal error during GPU init
[   14.411786] amdgpu 0000:05:00.0: amdgpu: amdgpu: finishing device.
[   14.411928] ------------[ cut here ]------------
[   14.411929] WARNING: CPU: 6 PID: 507 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:622 amdgpu_irq_put+0x46/0x70 [amdgpu]
[   14.412114] Modules linked in: amdgpu(+) video wmi amdxcp i2c_algo_bit drm_ttm_helper crct10dif_pclmul ttm crc32_pclmul crc32c_intel polyval_clmulni drm_exec polyval_generic ghash_clmulni_intel gpu_sched nvme sha512_ssse3 drm_suballoc_helper drm_buddy sha256_ssse3 drm_display_helper nvme_core sha1_ssse3 virtio_net cec nvme_auth virtio_console net_failover virtio_blk failover qemu_fw_cfg serio_raw ip6_tables ip_tables fuse
[   14.412133] CPU: 6 PID: 507 Comm: (udev-worker) Not tainted 6.8.5-201.fc39.x86_64 #1
[   14.412134] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
[   14.412135] RIP: 0010:amdgpu_irq_put+0x46/0x70 [amdgpu]
[   14.412305] Code: c0 74 33 48 8b 4e 10 48 83 39 00 74 29 89 d1 48 8d 04 88 8b 08 85 c9 74 11 f0 ff 08 74 07 31 c0 e9 6a 30 bc e3 e9 5a fd ff ff <0f> 0b b8 ea ff ff ff e9 59 30 bc e3 b8 ea ff ff ff e9 4f 30 bc e3
[   14.412306] RSP: 0018:ffffaae50112ba60 EFLAGS: 00010246
[   14.412308] RAX: ffff8bbcca3ed100 RBX: ffff8bbcd19987a8 RCX: 0000000000000000
[   14.412309] RDX: 0000000000000000 RSI: ffff8bbcd19a4db8 RDI: ffff8bbcd1980000
[   14.412310] RBP: ffff8bbcd19901e8 R08: 0000000000000000 R09: ffffaae50112b878
[   14.412311] R10: ffffaae50112b870 R11: 0000000000000003 R12: ffff8bbcd19905c8
[   14.412311] R13: ffff8bbcd1980010 R14: ffff8bbcd1980000 R15: ffff8bbcd19a4db8
[   14.412313] FS:  00007f5fde03e980(0000) GS:ffff8bc41fb80000(0000) knlGS:0000000000000000
[   14.412315] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   14.412316] CR2: 00005623742f1000 CR3: 000000010c1fa000 CR4: 0000000000750ef0
[   14.412318] PKRU: 55555554
[   14.412319] Call Trace:
[   14.412320]  <TASK>
[   14.412321]  ? amdgpu_irq_put+0x46/0x70 [amdgpu]
[   14.412493]  ? __warn+0x81/0x130
[   14.412497]  ? amdgpu_irq_put+0x46/0x70 [amdgpu]
[   14.412677]  ? report_bug+0x171/0x1a0
[   14.412681]  ? handle_bug+0x3c/0x80
[   14.412683]  ? exc_invalid_op+0x17/0x70
[   14.412685]  ? asm_exc_invalid_op+0x1a/0x20
[   14.412688]  ? amdgpu_irq_put+0x46/0x70 [amdgpu]
[   14.412857]  amdgpu_fence_driver_hw_fini+0xfe/0x130 [amdgpu]
[   14.413049]  amdgpu_device_fini_hw+0xa6/0x400 [amdgpu]
[   14.413233]  ? blocking_notifier_chain_unregister+0x36/0x50
[   14.413236]  amdgpu_driver_load_kms+0xec/0x190 [amdgpu]
[   14.413411]  amdgpu_pci_probe+0x18b/0x510 [amdgpu]
[   14.413586]  local_pci_probe+0x42/0xa0
[   14.413589]  pci_device_probe+0xc7/0x240
[   14.413592]  really_probe+0x19b/0x3e0
[   14.413595]  ? __pfx___driver_attach+0x10/0x10
[   14.413597]  __driver_probe_device+0x78/0x160
[   14.413599]  driver_probe_device+0x1f/0x90
[   14.413601]  __driver_attach+0xd2/0x1c0
[   14.413603]  bus_for_each_dev+0x85/0xd0
[   14.413605]  bus_add_driver+0x116/0x220
[   14.413607]  driver_register+0x59/0x100
[   14.413609]  ? __pfx_amdgpu_init+0x10/0x10 [amdgpu]
[   14.413768]  do_one_initcall+0x58/0x320
[   14.413772]  do_init_module+0x60/0x240
[   14.413775]  __do_sys_init_module+0x17f/0x1b0
[   14.413776]  ? srso_alias_return_thunk+0x5/0xfbef5
[   14.413782]  do_syscall_64+0x83/0x170
[   14.413784]  ? srso_alias_return_thunk+0x5/0xfbef5
[   14.413786]  ? __count_memcg_events+0x4d/0xc0
[   14.413788]  ? srso_alias_return_thunk+0x5/0xfbef5
[   14.413790]  ? count_memcg_events.constprop.0+0x1a/0x30
[   14.413792]  ? srso_alias_return_thunk+0x5/0xfbef5
[   14.413793]  ? handle_mm_fault+0xa2/0x360
[   14.413795]  ? srso_alias_return_thunk+0x5/0xfbef5
[   14.413797]  ? do_user_addr_fault+0x304/0x670
[   14.413800]  ? srso_alias_return_thunk+0x5/0xfbef5
[   14.413801]  ? srso_alias_return_thunk+0x5/0xfbef5
[   14.413803]  entry_SYSCALL_64_after_hwframe+0x78/0x80
[   14.413805] RIP: 0033:0x7f5fdea2cb9e
[   14.413808] Code: 48 8b 0d 95 12 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 62 12 0c 00 f7 d8 64 89 01 48
[   14.413809] RSP: 002b:00007ffc13be8998 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[   14.413811] RAX: ffffffffffffffda RBX: 00005623741c55a0 RCX: 00007f5fdea2cb9e
[   14.413812] RDX: 00005623741be530 RSI: 00000000019d58ce RDI: 00007f5fdb000010
[   14.413813] RBP: 00007ffc13be8a50 R08: 0000562374199010 R09: 0000000000000007
[   14.413814] R10: 0000000000000001 R11: 0000000000000246 R12: 00005623741be530
[   14.413814] R13: 0000000000020000 R14: 00005623741c0030 R15: 00005623741c9120
[   14.413817]  </TASK>
[   14.413818] ---[ end trace 0000000000000000 ]---

My kernel bisect yielded the following commit:

# first bad commit: [f9e54c3a2f5b79ecc57c7bc7d0d3521e461a2101]
    vfio/pci: implement huge_fault support
    
    With the addition of pfnmap support in vmf_insert_pfn_{pmd,pud}() we can
    take advantage of PMD and PUD faults to PCI BAR mmaps and create more
    efficient mappings.  PCI BARs are always a power of two and will typically
    get at least PMD alignment without userspace even trying.  Userspace
    alignment for PUD mappings is also not too difficult.
    
    Consolidate faults through a single handler with a new wrapper for
    standard single page faults.  The pre-faulting behavior of commit
    d71a989cf5d9 ("vfio/pci: Insert full vma on mmap'd MMIO fault") is removed
    in this refactoring since huge_fault will cover the bulk of the faults and
    results in more efficient page table usage.  We also want to avoid that
    pre-faulted single page mappings preempt huge page mappings.
    
    Link: https://lkml.kernel.org/r/20240826204353.2228736-20-peterx@redhat.com
    Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
    Signed-off-by: Peter Xu <peterx@redhat.com>
    Cc: Alexander Gordeev <agordeev@linux.ibm.com>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Gavin Shan <gshan@redhat.com>
    Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
    Cc: Heiko Carstens <hca@linux.ibm.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Niklas Schnelle <schnelle@linux.ibm.com>
    Cc: Paolo Bonzini <pbonzini@redhat.com>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Cc: Sean Christopherson <seanjc@google.com>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vasily Gorbik <gor@linux.ibm.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: Zi Yan <ziy@nvidia.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Reverting commit f9e54c3a2f5b79ecc57c7bc7d0d3521e461a2101 and rebuilding the vfio-pci-core module fixes the issue for me on v6.12.4 and v6.13-rc2.
For reference, the svm_set_msr regression does not appear to affect my system.

Last edited by Precific (2024-12-22 00:36:26)

loqs · 2024-12-25 17:34:31

For reference Precific's upstream report:
https://bugzilla.kernel.org/show_bug.cgi?id=219619
https://lore.kernel.org/regressions/202 … @bhelgaas/

Arch Linux

#26 2024-12-09 19:04:59

Re: VFIO gpu passthrough kernel regression

#27 2024-12-09 20:04:55

Re: VFIO gpu passthrough kernel regression

#28 2024-12-09 21:21:19

Re: VFIO gpu passthrough kernel regression

#29 2024-12-09 22:51:27

Re: VFIO gpu passthrough kernel regression

#30 2024-12-10 01:45:28

Re: VFIO gpu passthrough kernel regression

#31 2024-12-10 01:59:00

Re: VFIO gpu passthrough kernel regression

#32 2024-12-10 03:01:17

Re: VFIO gpu passthrough kernel regression

#33 2024-12-10 04:32:38

Re: VFIO gpu passthrough kernel regression

#34 2024-12-10 06:19:23

Re: VFIO gpu passthrough kernel regression

#35 2024-12-10 08:14:01

Re: VFIO gpu passthrough kernel regression

#36 2024-12-10 08:43:16

Re: VFIO gpu passthrough kernel regression

#37 2024-12-10 08:50:05

Re: VFIO gpu passthrough kernel regression

#38 2024-12-10 17:03:19

Re: VFIO gpu passthrough kernel regression

#39 2024-12-10 22:04:05

Re: VFIO gpu passthrough kernel regression

#40 2024-12-11 07:14:32

Re: VFIO gpu passthrough kernel regression

#41 2024-12-11 13:34:06

Re: VFIO gpu passthrough kernel regression

#42 2024-12-11 13:39:18

Re: VFIO gpu passthrough kernel regression

#43 2024-12-11 14:54:26

Re: VFIO gpu passthrough kernel regression

#44 2024-12-12 17:49:40

Re: VFIO gpu passthrough kernel regression

#45 2024-12-13 17:03:38

Re: VFIO gpu passthrough kernel regression

#46 2024-12-13 19:20:51

Re: VFIO gpu passthrough kernel regression

#47 2024-12-14 06:36:16

Re: VFIO gpu passthrough kernel regression

#48 2024-12-22 00:34:31

Re: VFIO gpu passthrough kernel regression

#49 2024-12-25 17:34:31

Re: VFIO gpu passthrough kernel regression

Board footer