You are not logged in.
Gromit, could you add the bisect output that tells how many more revisions to check ?
It would make following this thread much easier.
Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.
clean chroot building not flexible enough ?
Try clean chroot manager by graysky
Offline
Please try:
sudo pacman -U https://pkgbuild.com/\~gromit/linux-bisection-kernels/linux-mainline-v6.10.rc6.r2687.g4b6377f-1-x86_64.pkg.tar.zst
$ uname -r
6.10.0-rc6-1-mainline-02687-g4b6377f0e960
The bug is present.
Offline
Please test:
sudo pacman -U https://pkgbuild.com/\~gromit/linux-bisection-kernels/linux-mainline-v6.10.rc6.r2686.gcba7fec-1-x86_64.pkg.tar.zst
Also since Lone_Wolf asked for a progress indication:
Bisecting: 0 revisions left to test after this (roughly 1 step)
[cba7fec864172dadd953daefdd26e01742b71a6a] drm/amd/display: Add NULL check for clk_mgr and clk_mgr->funcs in dcn30_init_hw
Offline
Please test:
sudo pacman -U https://pkgbuild.com/\~gromit/linux-bisection-kernels/linux-mainline-v6.10.rc6.r2686.gcba7fec-1-x86_64.pkg.tar.zst
Also since Lone_Wolf asked for a progress indication:
Bisecting: 0 revisions left to test after this (roughly 1 step) [cba7fec864172dadd953daefdd26e01742b71a6a] drm/amd/display: Add NULL check for clk_mgr and clk_mgr->funcs in dcn30_init_hw
$ uname -r
6.10.0-rc6-1-mainline-02686-gcba7fec86417
The bug is present.
Offline
Please try:
sudo pacman -U https://pkgbuild.com/\~gromit/linux-bisection-kernels/linux-mainline-v6.10.rc6.r2685.g68e599d-1-x86_64.pkg.tar.zst
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[68e599db7a549f010a329515f3508d8a8c3467a4] drm/amdkfd: Validate user queue buffers
Offline
Please try:
sudo pacman -U https://pkgbuild.com/\~gromit/linux-bisection-kernels/linux-mainline-v6.10.rc6.r2685.g68e599d-1-x86_64.pkg.tar.zst
Bisecting: 0 revisions left to test after this (roughly 0 steps) [68e599db7a549f010a329515f3508d8a8c3467a4] drm/amdkfd: Validate user queue buffers
$ uname -r
6.10.0-rc6-1-mainline-02685-g68e599db7a54
The bug is present.
Offline
$ git bisect bad
68e599db7a549f010a329515f3508d8a8c3467a4 is the first bad commit
commit 68e599db7a549f010a329515f3508d8a8c3467a4
Author: Philip Yang <Philip.Yang@amd.com>
Date: Thu Jun 20 12:21:57 2024 -0400
drm/amdkfd: Validate user queue buffers
Find user queue rptr, ring buf, eop buffer and cwsr area BOs, and
check BOs are mapped on the GPU with correct size and take the BO
reference.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 4 ++++
drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 38 ++++++++++++++++++++++++++++++++++++--
2 files changed, 40 insertions(+), 2 deletions(-)
$ git bisect log
git bisect start
# status: waiting for both good and bad commits
# bad: [adc218676eef25575469234709c2d87185ca223a] Linux 6.12
git bisect bad adc218676eef25575469234709c2d87185ca223a
# good: [98f7e32f20d28ec452afb208f9cffc08448a2652] Linux 6.11
git bisect good 98f7e32f20d28ec452afb208f9cffc08448a2652
# bad: [509d2cd12a10d057fdf72f565b930f9a81140d59] Merge tag 'Smack-for-6.12' of https://github.com/cschaufler/smack-next
git bisect bad 509d2cd12a10d057fdf72f565b930f9a81140d59
# good: [7b17f5ebd5fc5e9275eaa5af3d0771f2a7b01bbf] Merge tag 'soc-dt-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
git bisect good 7b17f5ebd5fc5e9275eaa5af3d0771f2a7b01bbf
# good: [54450af662369efbd4cb438ce7b553dfffa00f07] Merge tag 'parisc-for-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
git bisect good 54450af662369efbd4cb438ce7b553dfffa00f07
# bad: [f0b7dcf25834afd17df316367dfe5d4c890c713c] drm/amd/display: Wait for all pending cleared before full update
git bisect bad f0b7dcf25834afd17df316367dfe5d4c890c713c
# good: [3f53d7e442197b7e7d56b470b02dfd37a8bc5c46] Merge tag 'drm-intel-gt-next-2024-08-23' of https://gitlab.freedesktop.org/drm/i915/kernel into drm-next
git bisect good 3f53d7e442197b7e7d56b470b02dfd37a8bc5c46
# bad: [fd69ef05029f9beb7b031ef96e7a36970806a670] drm/radeon: use GEM references instead of TTMs
git bisect bad fd69ef05029f9beb7b031ef96e7a36970806a670
# good: [9c7e69d2e1245fdd5fa5c65cd022530b2a5ef1b7] drm/amdgpu/gfx9: enable wave kill for compute queues
git bisect good 9c7e69d2e1245fdd5fa5c65cd022530b2a5ef1b7
# bad: [722e96c99f1d7532fdfbb557f50a399f6cc57d82] drm/amd/display: Check null pointers before using them
git bisect bad 722e96c99f1d7532fdfbb557f50a399f6cc57d82
# bad: [47c0388b0589cb481c294dcb857d25a214c46eb3] drm/amdgpu: reset vm state machine after gpu reset(vram lost)
git bisect bad 47c0388b0589cb481c294dcb857d25a214c46eb3
# bad: [2662b7d9d8bc1dda1f89f0dd33422e069f2f861c] drm/amdgpu/gfx11: properly handle error ints on all pipes
git bisect bad 2662b7d9d8bc1dda1f89f0dd33422e069f2f861c
# bad: [8284951a6e79c6806c675e5f68a4cd425dd56bc4] drm/amdgpu: fix ras UE error injection failure issue
git bisect bad 8284951a6e79c6806c675e5f68a4cd425dd56bc4
# bad: [4b6377f0e96085cbec96eb7f0b282430ccdd3d75] drm/amd/display: Add NULL check for clk_mgr and clk_mgr->funcs in dcn401_init_hw
git bisect bad 4b6377f0e96085cbec96eb7f0b282430ccdd3d75
# bad: [cba7fec864172dadd953daefdd26e01742b71a6a] drm/amd/display: Add NULL check for clk_mgr and clk_mgr->funcs in dcn30_init_hw
git bisect bad cba7fec864172dadd953daefdd26e01742b71a6a
# bad: [68e599db7a549f010a329515f3508d8a8c3467a4] drm/amdkfd: Validate user queue buffers
git bisect bad 68e599db7a549f010a329515f3508d8a8c3467a4
# first bad commit: [68e599db7a549f010a329515f3508d8a8c3467a4] drm/amdkfd: Validate user queue buffers
The kernel considers the buffers Davinci Resolve is passing in invalid?
Offline
It seems like torsten (the rocm maintainer in Arch) has found a similar problem with ollama rocm: https://gitlab.freedesktop.org/drm/amd/-/issues/3821
Atleast the commit is the same.
Offline
Thanks for linking these two issues, gromit. I agree with loqs: With the recently added buffer validation, some workloads on older or officially unsupported AMD GPUs cause a segfault. Note that the linked llama example works fine on officially supported hardware with any kernel version I tested.
Offline
Would it help to know which call is failing:
--- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
@@ -247,17 +247,23 @@ int kfd_queue_acquire_buffers(struct kfd_process_device *pdd, struct queue_prope
return err;
err = kfd_queue_buffer_get(vm, properties->write_ptr, &properties->wptr_bo, PAGE_SIZE);
- if (err)
+ if (err) {
+ pr_warn("kfd_queue_buffer_get failed for write_ptr\n");
goto out_err_unreserve;
+ }
err = kfd_queue_buffer_get(vm, properties->read_ptr, &properties->rptr_bo, PAGE_SIZE);
- if (err)
+ if (err) {
+ pr_warn("kfd_queue_buffer_get failed for read_ptr\n");
goto out_err_unreserve;
+ }
err = kfd_queue_buffer_get(vm, (void *)properties->queue_address,
&properties->ring_bo, properties->queue_size);
- if (err)
+ if (err) {
+ pr_warn("kfd_queue_buffer_get failed for queue_address\n");
goto out_err_unreserve;
+ }
/* only compute queue requires EOP buffer and CWSR area */
if (properties->type != KFD_QUEUE_TYPE_COMPUTE)
@@ -275,8 +281,10 @@ int kfd_queue_acquire_buffers(struct kfd_process_device *pdd, struct queue_prope
err = kfd_queue_buffer_get(vm, (void *)properties->eop_ring_buffer_address,
&properties->eop_buf_bo,
properties->eop_ring_buffer_size);
- if (err)
+ if (err) {
+ pr_warn("kfd_queue_buffer_get failed for eop_ring_buffer_address\n");
goto out_err_unreserve;
+ }
}
if (properties->ctl_stack_size != topo_dev->node_props.ctl_stack_size) {
@@ -301,8 +309,10 @@ int kfd_queue_acquire_buffers(struct kfd_process_device *pdd, struct queue_prope
err = kfd_queue_buffer_get(vm, (void *)properties->ctx_save_restore_area_address,
&properties->cwsr_bo, total_cwsr_size);
- if (!err)
+ if (!err) {
+ pr_warn("kfd_queue_buffer_get failed for ctx_save_restore_area_address\n");
goto out_unreserve;
+ }
amdgpu_bo_unreserve(vm->root.bo);
Offline
Here is a build that has the above patch from @loqs included on top of stable:
sudo pacman -U https://pkgbuild.com/\~gromit/linux-bisection-kernels/linux-6.12.4.arch1-1.1-x86_64.pkg.tar.zst
Offline
Here is a build that has the above patch from @loqs included on top of stable:
sudo pacman -U https://pkgbuild.com/\~gromit/linux-bisection-kernels/linux-6.12.4.arch1-1.1-x86_64.pkg.tar.zst
$ uname -r
6.12.4-arch1-1.1
The bug remains.
Offline
Yes that is expected, but which output did you get in "dmesg"? The output should help use to further classify what actually goes wrong.
Offline
Yes that is expected, but which output did you get in "dmesg"? The output should help use to further classify what actually goes wrong.
Here's what I get in dmesg after attempting to load a project in DR:
[ 67.484928] process 'opt/resolve/bin/resolve' started with executable stack
[ 72.014516] amdgpu: kfd_queue_buffer_get failed for queue_address
[ 72.019401] amdgpu: kfd_queue_buffer_get failed for queue_address
[ 72.024083] amdgpu: kfd_queue_buffer_get failed for queue_address
[ 72.028133] amdgpu: kfd_queue_buffer_get failed for queue_address
[ 72.030921] amdgpu: kfd_queue_buffer_get failed for queue_address
[ 72.033369] amdgpu: kfd_queue_buffer_get failed for queue_address
[ 72.035695] amdgpu: kfd_queue_buffer_get failed for queue_address
[ 72.038035] amdgpu: kfd_queue_buffer_get failed for queue_address
[ 72.040342] amdgpu: kfd_queue_buffer_get failed for queue_address
[ 75.977967] amdgpu: kfd_queue_buffer_get failed for queue_address
[ 75.980306] amdgpu: kfd_queue_buffer_get failed for queue_address
[ 75.982951] amdgpu: kfd_queue_buffer_get failed for queue_address
[ 75.985348] amdgpu: kfd_queue_buffer_get failed for queue_address
[ 75.987880] amdgpu: kfd_queue_buffer_get failed for queue_address
[ 75.990365] amdgpu: kfd_queue_buffer_get failed for queue_address
[ 75.992677] amdgpu: kfd_queue_buffer_get failed for queue_address
[ 75.994912] amdgpu: kfd_queue_buffer_get failed for queue_address
[ 75.997083] amdgpu: kfd_queue_buffer_get failed for queue_address
[ 76.035510] amdgpu: kfd_queue_buffer_get failed for queue_address
[ 76.042406] amdgpu: kfd_queue_buffer_get failed for queue_address
[ 76.046844] amdgpu: kfd_queue_buffer_get failed for queue_address
[ 76.050682] amdgpu: kfd_queue_buffer_get failed for queue_address
[ 76.053626] amdgpu: kfd_queue_buffer_get failed for queue_address
[ 76.058706] amdgpu: kfd_queue_buffer_get failed for queue_address
[ 76.062679] amdgpu: kfd_queue_buffer_get failed for queue_address
[ 76.067234] amdgpu: kfd_queue_buffer_get failed for queue_address
[ 76.070642] amdgpu: kfd_queue_buffer_get failed for queue_address
Offline
@Vladar can you please supply the information requested in https://gitlab.freedesktop.org/drm/amd/ … te_2704553
echo "file kfd_queue.c +pfl" | sudo tee /sys/kernel/debug/dynamic_debug/control
Then run Davinci Resolve to trigger the issue and post the output of `dmesg` after doing so.
Offline
@Vladar can you please supply the information requested in https://gitlab.freedesktop.org/drm/amd/ … te_2704553
echo "file kfd_queue.c +pfl" | sudo tee /sys/kernel/debug/dynamic_debug/control
Then run Davinci Resolve to trigger the issue and post the output of `dmesg` after doing so.
$ uname -r
6.12.4-arch1-1
[ 322.807818] process 'opt/resolve/bin/resolve' started with executable stack
[ 327.357101] kfd_queue_buffer_get:212: amdgpu: expected size 0x2000 not equal to mapping addr 0x2981a000 size 0x1000
(the last line is repeated multiple times)
Offline