You are not logged in.
Pages: 1
After a recent update, my monitors suddenly black out at random intervals. After each blackout, the monitor
refuses to wake back up and I'm forced to do a hard reboot. At first, it only occured when I left my desktop
unattended for ~20 minutes, leading me to believe that I had on dpms or screensaver. However, xset
showed that I had disabled dpms and screensaver (relevant output shown below)
Screen Saver:
prefer blanking: yes allow exposures: yes
timeout: 0 cycle: 600
DPMS (Energy Star):
Standby: 600 Suspend: 600 Off: 600
DPMS is Disabled
The next time my monitor blacked, I was writing a document in Latex, so I've ruled out dpms and screen saver as
possibilities. After rebooting, I reviewed the journal and found some kernel messages for the graphics card
Jan 12 20:59:06 NageArch kernel: radeon 0000:03:00.0: ring 0 stalled for more than 10330msec
Jan 12 20:59:06 NageArch kernel: radeon 0000:03:00.0: GPU lockup (current fence id 0x000000000005e2b1 last fence id 0x000000000005e2cc on ring 0)
Jan 12 20:59:06 NageArch kernel: radeon 0000:03:00.0: failed to get a new IB (-35)
Jan 12 20:59:06 NageArch kernel: [drm:radeon_cs_ioctl [radeon]] *ERROR* Failed to get ib !
Jan 12 20:59:06 NageArch kernel: radeon 0000:03:00.0: failed to get a new IB (-35)
Jan 12 20:59:06 NageArch kernel: [drm:radeon_cs_ioctl [radeon]] *ERROR* Failed to get ib !
Jan 12 20:59:06 NageArch kernel: BUG: unable to handle page fault for address: ffffb3cac0a65ffc
Jan 12 20:59:06 NageArch kernel: #PF: supervisor read access in kernel mode
Jan 12 20:59:06 NageArch kernel: #PF: error_code(0x0000) - not-present page
followed by a dump of the kernel state and a call trace, which looks like this
Jan 12 20:59:06 NageArch kernel: Call Trace:
Jan 12 20:59:06 NageArch kernel: radeon_gpu_reset+0xc7/0x2f0 [radeon]
Jan 12 20:59:06 NageArch kernel: radeon_cs_ioctl+0x28d/0x7d0 [radeon]
Jan 12 20:59:06 NageArch kernel: ? __switch_to_asm+0x34/0x70
Jan 12 20:59:06 NageArch kernel: ? radeon_cs_parser_init+0x500/0x500 [radeon]
Jan 12 20:59:06 NageArch kernel: drm_ioctl_kernel+0xb2/0x100 [drm]
Jan 12 20:59:06 NageArch kernel: drm_ioctl+0x209/0x360 [drm]
Jan 12 20:59:06 NageArch kernel: ? radeon_cs_parser_init+0x500/0x500 [radeon]
Jan 12 20:59:06 NageArch kernel: radeon_drm_ioctl+0x49/0x80 [radeon]
Jan 12 20:59:06 NageArch kernel: do_vfs_ioctl+0x43d/0x6c0
Jan 12 20:59:06 NageArch kernel: ksys_ioctl+0x5e/0x90
Jan 12 20:59:06 NageArch kernel: __x64_sys_ioctl+0x16/0x20
Jan 12 20:59:06 NageArch kernel: do_syscall_64+0x4e/0x140
Jan 12 20:59:06 NageArch kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
After reading this, I had narrowed it down to a driver error, so I reviewed the details of my provider (xf86-video-amdgpu)
but I noticed that the last update had occurred in October of 2019 and this bug had just started, so I'm not sure
if it's the behavior of the driver or something else. I haven't been able to reproduce the error on purpose but I believe
that it may be linked to glrnvim, since I was running it before the bug occurred both times. Any help is welcome
Offline
It's rather the kernel module than the X11 driver.
Try passing
radeon.audio=0 radeon.dpm=0 radeon.aspm=0 radeon.runpm=0 radeon.bapm=0 radeon.backlight=0
to the kernel.
https://wiki.archlinux.org/index.php/Kernel_parameters
This will disable audio, a bunch of powermanagement features and backlight handling - see if the problem remains and if not, try to isolate the crucial one (probably one of the PM features)
There's also a chance that this is induced by the xf86 driver, but xf86-video-amdgpu doesn't control the radeon module anyway (but the newer amdgpu one for southern island chips and newer)
Offline
I rebooted and passed the parameters to the kernel but this seemed to make the bug worse. The monitor
blacked in ~7 minutes both times that I used these parameters with error logs almost identical. Notably,
I was watching youtube both times and within seconds of the video begininning the blackout happened.
Previously, this bug did not occur while I was watching videos so I can only conclude that one of the parameters
caused this behavior, I will be testing to isolate which one.
Offline
So, I did a little more digging and it seems that AMD GPUs have been having trouble with the kernel since version 5.1.14. I looked all the way back to my logs the day the GPU error first
occurred (2019-1-12) and found that the module is causing a kernel oops. I'm not sure what is causing it, but looking through logs of a similiar error with other AMD GPUs shows that
the others attempt a soft reboot. Following advice from here advises passing the kernel parameters
idle=nowait rcu_nocbs=0-3
I've rebooted and passed the parameters to the kernel, will add an further results to the thread.
On another note, I found the first kernel oops:
Jan 12 16:13:52 NageArch kernel: Oops: 0000 [#1] PREEMPT SMP PTI
Jan 12 16:13:52 NageArch kernel: CPU: 8 PID: 575 Comm: Xorg:rcs0 Not tainted 5.4.10-arch1-1 #1
Jan 12 16:13:52 NageArch kernel: Hardware name: Dell Inc. Precision T3600/08HPGT, BIOS A14 09/29/2014
Jan 12 16:13:52 NageArch kernel: RIP: 0010:radeon_ring_backup+0xc0/0x140 [radeon]
Jan 12 16:13:52 NageArch kernel: Code: db 49 89 06 48 85 c0 74 7b 41 8d 7c 24 ff 31 d2 48 c1 e7 02 eb 07 49 8b 06 48 83 c2 04 48 8b 75 08 8d 4b 01 89 db 48 8d 34 9e <8b> 36 89 34 10 23 4d 54 89 cb 48 39 d7 75 dd 4c 89 ef e8 e9 ef 34
Jan 12 16:13:52 NageArch kernel: RSP: 0018:ffffa085c08079d8 EFLAGS: 00010206
Jan 12 16:13:52 NageArch kernel: RAX: ffff8ba6b1700000 RBX: 00000000ffffffff RCX: 0000000000000000
Jan 12 16:13:52 NageArch kernel: RDX: 0000000000000000 RSI: ffffa089c16d4ffc RDI: 00000000000d4b00
Jan 12 16:13:52 NageArch kernel: RBP: ffff8ba7f13b94c8 R08: 00000000000301c7 R09: 00000000001eb000
Jan 12 16:13:52 NageArch kernel: R10: 00000000000301c0 R11: 0000000000000000 R12: 00000000000352c1
Jan 12 16:13:52 NageArch kernel: R13: ffff8ba7f13b94a8 R14: ffffa085c0807a40 R15: ffff8ba7f13b8000
Jan 12 16:13:52 NageArch kernel: FS: 00007fa46fdcd700(0000) GS:ffff8ba7ff600000(0000) knlGS:0000000000000000
Jan 12 16:13:52 NageArch kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 12 16:13:52 NageArch kernel: CR2: ffffa089c16d4ffc CR3: 0000000438d5c004 CR4: 00000000000606e0
Reviewing my pacman.log file, I found that I had upgraded to the latest version of mesa just prior to this event, which seems to be the most likely candidate for the crash.
The log shows that libpulse, shaderc, and pulseaudio were also upgraded, although it seems less likely that one of these would cause screen blackouts.
Offline
After performing the reboot, a screen blackout occurred shortly after I opened libreoffice, I passed the parameter
iommu=pt
after I had rebooted, so far I've opened libreoffice again without problems.
Offline
Pages: 1