You are not logged in.
Hi!
Seemingly randomly, but always when I've left my computer idling (never when it is actually being used), I've had some AMDGPU related hangs since last friday.
I checked and seems that problems started since upgrading to kernel-5.3.1 (zen branch). Anyone with similar problems?
Most often there are kernel messages similar to this:
syys 30 22:33:08 ArkkiVille kernel: ------------[ cut here ]------------
syys 30 22:33:08 ArkkiVille kernel: Missing get_user_page_done
syys 30 22:33:08 ArkkiVille kernel: WARNING: CPU: 3 PID: 2760 at drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:992 amdgpu_ttm_backend_unbind+0x140/0x150 [amdgpu]
syys 30 22:33:08 ArkkiVille kernel: Modules linked in: rfcomm ip6table_filter ip6_tables iptable_filter cmac algif_hash algif_skcipher af_alg bnep msr uinput nct6775 hwmon_vid intel_rapl_msr intel_rapl_common fuse x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nls_iso8859_1 irqbypass nls_cp437 vfat fat crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper intel_cstate intel_uncore intel_rapl_perf btusb btrtl btbcm btintel hid_steam wmi_bmof bluetooth snd_hda_codec_realtek snd_hda_codec_generic ecdh_generic ledtrig_audio rfkill tda18271c2dd snd_hda_codec_hdmi snd_hda_intel snd_hda_codec input_leds snd_hda_core joydev mousedev snd_hwdep ecc snd_pcm snd_timer drxk snd mei_hdcp iTCO_wdt mxm_wmi iTCO_vendor_support gpio_ich soundcore ddbridge evdev mei_me dvb_core mei mac_hid videobuf2_vmalloc videobuf2_memops videobuf2_common videodev mc e1000e lpc_ich i2c_i801 wmi vboxnetflt(OE) vboxnetadp(OE) vboxpci(OE) vboxdrv(OE) overlay sg nfsd auth_rpcgss vhba
syys 30 22:33:08 ArkkiVille kernel: nfs_acl lockd grace sunrpc crypto_user ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 sr_mod cdrom hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid sd_mod ahci libahci libata xhci_pci crc32c_intel scsi_mod xhci_hcd ehci_pci ehci_hcd amdgpu gpu_sched i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm agpgart bcache crc64
syys 30 22:33:08 ArkkiVille kernel: CPU: 3 PID: 2760 Comm: FahCore_21 Tainted: G OE 5.3.1-zen1-1-zen #1
syys 30 22:33:08 ArkkiVille kernel: Hardware name: ASUS All Series/MAXIMUS VII GENE, BIOS 3503 04/18/2018
syys 30 22:33:08 ArkkiVille kernel: RIP: 0010:amdgpu_ttm_backend_unbind+0x140/0x150 [amdgpu]
syys 30 22:33:08 ArkkiVille kernel: Code: d7 48 39 c6 0f 85 f8 fe ff ff 80 3d b1 44 43 00 00 0f 85 eb fe ff ff 48 c7 c7 43 68 68 c0 c6 05 9d 44 43 00 01 e8 8f 0e d8 d5 <0f> 0b e9 d1 fe ff ff 31 c0 eb cc 0f 1f 44 00 00 0f 1f 44 00 00 41
syys 30 22:33:08 ArkkiVille kernel: RSP: 0018:ffff983ecb127980 EFLAGS: 00010282
syys 30 22:33:08 ArkkiVille kernel: RAX: 0000000000000000 RBX: ffff909df6c21c00 RCX: 0000000000000000
syys 30 22:33:08 ArkkiVille kernel: RDX: 0000000000000001 RSI: 0000000000000096 RDI: 00000000ffffffff
syys 30 22:33:08 ArkkiVille kernel: RBP: ffff909ec6784088 R08: 00000000000004e8 R09: 0000000000000001
syys 30 22:33:08 ArkkiVille kernel: R10: 0000000000000001 R11: 0000000000002f5c R12: 0000000000000000
syys 30 22:33:08 ArkkiVille kernel: R13: ffff909df6c21c00 R14: ffff983ecb127ac0 R15: 0000000000000000
syys 30 22:33:08 ArkkiVille kernel: FS: 00007fdfa0283700(0000) GS:ffff909eceac0000(0000) knlGS:0000000000000000
syys 30 22:33:08 ArkkiVille kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
syys 30 22:33:08 ArkkiVille kernel: CR2: 00007f1324ec9ffc CR3: 0000000346e38005 CR4: 00000000001606e0
syys 30 22:33:08 ArkkiVille kernel: Call Trace:
syys 30 22:33:08 ArkkiVille kernel: ttm_tt_unbind+0x1d/0x30 [ttm]
syys 30 22:33:08 ArkkiVille kernel: ttm_bo_move_ttm+0x43/0x120 [ttm]
syys 30 22:33:08 ArkkiVille kernel: ttm_bo_handle_move_mem+0x55a/0x580 [ttm]
syys 30 22:33:08 ArkkiVille kernel: ttm_bo_validate+0x26b/0x2d0 [ttm]
syys 30 22:33:08 ArkkiVille kernel: amdgpu_cs_list_validate+0xd9/0x180 [amdgpu]
syys 30 22:33:08 ArkkiVille kernel: amdgpu_cs_ioctl+0xc83/0x1e50 [amdgpu]
syys 30 22:33:08 ArkkiVille kernel: ? _raw_spin_lock_irqsave+0x26/0x50
syys 30 22:33:08 ArkkiVille kernel: ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
syys 30 22:33:08 ArkkiVille kernel: drm_ioctl_kernel+0xb8/0x100 [drm]
syys 30 22:33:08 ArkkiVille kernel: drm_ioctl+0x253/0x3f0 [drm]
syys 30 22:33:08 ArkkiVille kernel: ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
syys 30 22:33:08 ArkkiVille kernel: amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
syys 30 22:33:08 ArkkiVille kernel: do_vfs_ioctl+0x43d/0x7a0
syys 30 22:33:08 ArkkiVille kernel: ? syscall_trace_enter+0x1f2/0x2d0
syys 30 22:33:08 ArkkiVille kernel: __x64_sys_ioctl+0x62/0x90
syys 30 22:33:08 ArkkiVille kernel: do_syscall_64+0x5f/0x1c0
syys 30 22:33:08 ArkkiVille kernel: ? prepare_exit_to_usermode+0x85/0xb0
syys 30 22:33:08 ArkkiVille kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
syys 30 22:33:08 ArkkiVille kernel: RIP: 0033:0x7fdf9fda921b
syys 30 22:33:08 ArkkiVille kernel: Code: 0f 1e fa 48 8b 05 75 8c 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 45 8c 0c 00 f7 d8 64 89 01 48
syys 30 22:33:08 ArkkiVille kernel: RSP: 002b:00007fdfa0282598 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
syys 30 22:33:08 ArkkiVille kernel: RAX: ffffffffffffffda RBX: 00007fdfa0282600 RCX: 00007fdf9fda921b
syys 30 22:33:08 ArkkiVille kernel: RDX: 00007fdfa0282600 RSI: 00000000c0186444 RDI: 0000000000000010
syys 30 22:33:08 ArkkiVille kernel: RBP: 00000000c0186444 R08: 00007fdfa02826b0 R09: 0000000000000010
syys 30 22:33:08 ArkkiVille kernel: R10: 00007fdfa02826b0 R11: 0000000000000246 R12: 00000000011a6a20
syys 30 22:33:08 ArkkiVille kernel: R13: 0000000000000010 R14: 00007fdf88242630 R15: 000000002c5551e0
syys 30 22:33:08 ArkkiVille kernel: ---[ end trace db7062d8128318e3 ]---
These are present every time the hang happends, except once (and I've had around five or six so far).
The hang: screen stays black, it can not wake up. UI is "hung" otherwise, too - I can not do anything via keyboard (or mouse), not even blindly (switch to VT, log in - even reboot via CTRL+ALT+DEL does not work); no reaction from CAPS/SCROLL/NumLock leds (however I'm running ckb-next which can sometimes block keys, if the system is hang in some way which prevents the process of ckb-next from executing properly - but I can still see some color LED animation on the keyboard, so it seems ckb-next is running properly and should not be blocking input)
I can SSH into the computer and shut down (and do other stuff, too) but shutdown will take ages (around 15 minutes). I can see in the log there that long a delay from "Journal stopped" to actual reboot. Not sure what is going on in the meantime, since logging has been stopped - and monitor will remain black so I can not see if there are some kernel messages! Many times I've actually synced via SysRQ since I didn't want to wait until actual reboot (or more specifically, SysRQ+SSSSSSREISU [wait...] B), but just once I waited since I figured it would reboot eventually ... and it did.
One thing worth pointing out is that I'm folding in the background all the time (unless doing something else which is GPU intensive, in which case I will pause folding temporarily so it will not disrupt my gaming session business session)!
Thanks for any input!
EDIT: OOPS! Yet again I may have chosen the wrong section. Kernel&HW would have been more appropriate!
Last edited by Wild Penguin (2019-11-10 17:45:23)
Offline
Ok, this is weird. Problems might not be with amdgpu after all. Instead there seems to be a kernel memory leak!
$ free -h
total used free shared buff/cache available
Mem: 15Gi 14Gi 567Mi 0,0Ki 276Mi 542Mi
Swap: 32Gi 215Mi 32Gi
I can see these problems started around last Friday (after kernel upgrade). I've also noticed earlyoom (which I've installed since occasionallyso firefox and certain beta software I've been playing with won't grind the whole system to a near-halt). Because of earlyoom, I can see from the logs it is trying to kill processes when system memory runs low - and from it's status output I can see something is slowly eating RAM! However, as earlyOOM tries to kil foldingathome, I believe this causes amdgpu driver to crash (I need to check previous logs to be sure). However: from top (or htop) I can not see a process which is eating this memory! [EDIT:]: Also, earlyoom is trying to kill the wrong culprit, and solve nothing - after killing processes mem usage is still high!
Now I'm trying to:
look via Google how to diagnose low memory issues when it is not caused by a user-space process and
trying to learn how to use /proc/slabinfo and slabtop.
It is difficult to find information via Google about low memory issues, since most hits point into newbie(ish) users who mistake cache/buffer usage for low memory (which is not the case here, memory is actually low). There is probable actually usefull information out there but it is difficult to find among those hits.
My plan: switch back to mainline (non -zen) kernel. Zen has debugging features, such as kernel memleak testing, disabled (to increase responsiveness for desktop users). I'll see if the memory leaks will go away by the switch - then I know zen kernel is the actual culprit. If not, I have at least something new to learn (how to diagnose actual Kernel memory leaks)...
This is just FYI in case someone seems to be having a similar problem.
Also, any input how to proceed to find the actual cuplrit is welcome.
Also, realised I put kind of little information in my OP, here is some:
$ sudo inxi -mFxz
System: Host: ArkkiVille Kernel: 5.3.1-zen1-1-zen x86_64 bits: 64 compiler: gcc v: 9.1.0 Console: tty 4 Distro: Arch Linux
Machine: Type: Desktop System: ASUS product: All Series v: N/A serial: N/A
Mobo: ASUSTeK model: MAXIMUS VII GENE v: Rev 1.xx serial: <filter> UEFI: American Megatrends v: 3503
date: 04/18/2018
Memory: RAM: total: 15.58 GiB used: 14.86 GiB (95.4%)
Array-1: capacity: 32 GiB slots: 4 EC: None max module size: 8 GiB note: est.
Device-1: DIMM_A1 size: No Module Installed
Device-2: DIMM_A2 size: 8 GiB speed: 1333 MT/s type: DDR3
Device-3: DIMM_B1 size: No Module Installed
Device-4: DIMM_B2 size: 8 GiB speed: 1333 MT/s type: DDR3
CPU: Topology: Quad Core model: Intel Core i7-4790K bits: 64 type: MT MCP arch: Haswell rev: 3 L2 cache: 8192 KiB
flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx bogomips: 64111
Speed: 4008 MHz min/max: 800/4400 MHz Core speeds (MHz): 1: 4008 2: 4008 3: 4009 4: 4008 5: 4012 6: 4007 7: 4009
8: 4013
Graphics: Device-1: Advanced Micro Devices [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] driver: amdgpu v: kernel
bus ID: 03:00.0
Display: server: X.org 1.20.5 driver: amdgpu unloaded: modesetting tty: 157x50
Message: Advanced graphics data unavailable in console for root.
Audio: Device-1: Intel 9 Series Family HD Audio vendor: ASUSTeK driver: snd_hda_intel v: kernel bus ID: 00:1b.0
Device-2: Advanced Micro Devices [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] driver: snd_hda_intel v: kernel
bus ID: 03:00.1
Device-3: Digital Devices Octopus DVB Adapter driver: ddbridge v: 0.9.33-integrated bus ID: 05:00.0
Sound Server: ALSA v: k5.3.1-zen1-1-zen
Network: Device-1: Intel Ethernet I218-V vendor: ASUSTeK driver: e1000e v: 3.2.6-k port: f040 bus ID: 00:19.0
IF: eno1 state: up speed: 1000 Mbps duplex: full mac: <filter>
Drives: Local Storage: total: 17.74 TiB used: 14.32 TiB (80.7%)
ID-1: /dev/nvme0n1 vendor: Samsung model: SSD 960 EVO 500GB size: 465.76 GiB
ID-2: /dev/sda vendor: Western Digital model: WD30EFRX-68AX9N0 size: 2.73 TiB temp: 38 C
ID-3: /dev/sdb vendor: Western Digital model: WD40EFRX-68WT0N0 size: 3.64 TiB
ID-4: /dev/sdc vendor: Western Digital model: WD80EFAX-68KNBN0 size: 7.28 TiB temp: 50 C
ID-5: /dev/sdd vendor: Western Digital model: WD40EFRX-68WT0N0 size: 3.64 TiB temp: 40 C
Partition: ID-1: / size: 3.65 TiB used: 632.71 GiB (16.9%) fs: ext4 dev: /dev/bcache2
ID-2: /boot size: 547.9 MiB used: 121.1 MiB (22.1%) fs: vfat dev: /dev/nvme0n1p1
ID-3: swap-1 size: 32.46 GiB used: 212.2 MiB (0.6%) fs: swap dev: /dev/nvme0n1p2
Sensors: System Temperatures: cpu: 38.0 C mobo: 37.0 C gpu: amdgpu temp: 29 C
Fan Speeds (RPM): N/A gpu: amdgpu fan: 0
Voltages: 12v: 12.10 5v: 5.04 3.3v: 3.31 vbat: 3.28
Info: Processes: 298 Uptime: 22h 58m Init: systemd Compilers: gcc: 9.1.0 clang: 8.0.1 Shell: bash v: 5.0.11 inxi: 3.0.36
cat /proc/slabinfo (while mem usage is high):
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
fuse_inode 1053 1209 832 39 8 : tunables 0 0 0 : slabdata 31 31 0
fat_inode_cache 88 88 728 22 4 : tunables 0 0 0 : slabdata 4 4 0
fat_cache 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0
kvm_async_pf 0 0 136 30 1 : tunables 0 0 0 : slabdata 0 0 0
kvm_vcpu 0 0 16256 2 8 : tunables 0 0 0 : slabdata 0 0 0
kvm_mmu_page_header 0 0 160 25 1 : tunables 0 0 0 : slabdata 0 0 0
x86_fpu 0 0 4160 7 8 : tunables 0 0 0 : slabdata 0 0 0
ovl_inode 0 0 672 24 4 : tunables 0 0 0 : slabdata 0 0 0
nfs4_layout_stateid 0 0 280 29 2 : tunables 0 0 0 : slabdata 0 0 0
nfsd4_delegations 0 0 232 35 2 : tunables 0 0 0 : slabdata 0 0 0
nfsd4_stateids 0 0 168 24 1 : tunables 0 0 0 : slabdata 0 0 0
nfsd4_files 0 0 288 28 2 : tunables 0 0 0 : slabdata 0 0 0
nfsd4_lockowners 160 180 392 20 2 : tunables 0 0 0 : slabdata 9 9 0
nfsd4_openowners 0 0 432 37 4 : tunables 0 0 0 : slabdata 0 0 0
nfsd4_clients 0 0 1296 25 8 : tunables 0 0 0 : slabdata 0 0 0
rpc_inode_cache 51 75 640 25 4 : tunables 0 0 0 : slabdata 3 3 0
ext4_groupinfo_4k 141624 141624 144 28 1 : tunables 0 0 0 : slabdata 5058 5058 0
ext4_inode_cache 7526 12492 1080 30 8 : tunables 0 0 0 : slabdata 429 429 0
ext4_allocation_context 256 256 128 32 1 : tunables 0 0 0 : slabdata 8 8 0
ext4_extent_status 862 1224 40 102 1 : tunables 0 0 0 : slabdata 12 12 0
mbcache 584 876 56 73 1 : tunables 0 0 0 : slabdata 12 12 0
jbd2_journal_head 272 408 120 34 1 : tunables 0 0 0 : slabdata 12 12 0
jbd2_revoke_table_s 1024 1024 16 256 1 : tunables 0 0 0 : slabdata 4 4 0
jbd2_revoke_record_s 1024 1536 32 128 1 : tunables 0 0 0 : slabdata 12 12 0
scsi_sense_cache 736 736 128 32 1 : tunables 0 0 0 : slabdata 23 23 0
search 248 351 592 27 4 : tunables 0 0 0 : slabdata 13 13 0
ip6-frags 0 0 184 22 1 : tunables 0 0 0 : slabdata 0 0 0
PINGv6 0 0 1216 26 8 : tunables 0 0 0 : slabdata 0 0 0
RAWv6 390 442 1216 26 8 : tunables 0 0 0 : slabdata 17 17 0
UDPv6 1251 1275 1280 25 8 : tunables 0 0 0 : slabdata 51 51 0
tw_sock_TCPv6 272 272 240 34 2 : tunables 0 0 0 : slabdata 8 8 0
request_sock_TCPv6 208 208 304 26 2 : tunables 0 0 0 : slabdata 8 8 0
TCPv6 482 520 2368 13 8 : tunables 0 0 0 : slabdata 40 40 0
bfq_io_cq 508 900 160 25 1 : tunables 0 0 0 : slabdata 36 36 0
mqueue_inode_cache 340 340 960 34 8 : tunables 0 0 0 : slabdata 10 10 0
fscrypt_info 512 768 64 64 1 : tunables 0 0 0 : slabdata 12 12 0
fscrypt_ctx 680 680 48 85 1 : tunables 0 0 0 : slabdata 8 8 0
dnotify_struct 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
pid_namespace 195 195 208 39 2 : tunables 0 0 0 : slabdata 5 5 0
posix_timers_cache 0 0 240 34 2 : tunables 0 0 0 : slabdata 0 0 0
UNIX 6082 6244 1024 32 8 : tunables 0 0 0 : slabdata 196 196 0
ip4-frags 20 20 200 20 1 : tunables 0 0 0 : slabdata 1 1 0
xfrm_state 0 0 704 23 4 : tunables 0 0 0 : slabdata 0 0 0
PING 0 0 960 34 8 : tunables 0 0 0 : slabdata 0 0 0
RAW 480 544 1024 32 8 : tunables 0 0 0 : slabdata 17 17 0
tw_sock_TCP 272 340 240 34 2 : tunables 0 0 0 : slabdata 10 10 0
request_sock_TCP 208 208 304 26 2 : tunables 0 0 0 : slabdata 8 8 0
TCP 478 574 2240 14 8 : tunables 0 0 0 : slabdata 41 41 0
hugetlbfs_inode_cache 52 52 616 26 4 : tunables 0 0 0 : slabdata 2 2 0
dquot 256 256 256 32 2 : tunables 0 0 0 : slabdata 8 8 0
eventpoll_pwq 3920 4144 72 56 1 : tunables 0 0 0 : slabdata 74 74 0
dax_cache 21 21 768 21 4 : tunables 0 0 0 : slabdata 1 1 0
request_queue 105 105 2088 15 8 : tunables 0 0 0 : slabdata 7 7 0
biovec-max 106 129 4096 8 8 : tunables 0 0 0 : slabdata 17 17 0
biovec-128 136 144 2048 16 8 : tunables 0 0 0 : slabdata 9 9 0
biovec-64 256 256 1024 32 8 : tunables 0 0 0 : slabdata 8 8 0
user_namespace 210 210 536 30 4 : tunables 0 0 0 : slabdata 7 7 0
dmaengine-unmap-256 15 15 2112 15 8 : tunables 0 0 0 : slabdata 1 1 0
dmaengine-unmap-128 30 30 1088 30 8 : tunables 0 0 0 : slabdata 1 1 0
dmaengine-unmap-16 301 420 192 21 1 : tunables 0 0 0 : slabdata 20 20 0
dmaengine-unmap-2 178466 247744 64 64 1 : tunables 0 0 0 : slabdata 3871 3871 0
sock_inode_cache 8126 8350 832 39 8 : tunables 0 0 0 : slabdata 215 215 0
skbuff_ext_cache 993 1216 128 32 1 : tunables 0 0 0 : slabdata 38 38 0
skbuff_fclone_cache 443 1088 512 32 4 : tunables 0 0 0 : slabdata 34 34 0
skbuff_head_cache 6957 7296 256 32 2 : tunables 0 0 0 : slabdata 228 228 0
configfs_dir_cache 42 42 96 42 1 : tunables 0 0 0 : slabdata 1 1 0
file_lock_cache 296 296 216 37 2 : tunables 0 0 0 : slabdata 8 8 0
fsnotify_mark_connector 1140 1920 32 128 1 : tunables 0 0 0 : slabdata 15 15 0
net_namespace 72 72 4864 6 8 : tunables 0 0 0 : slabdata 12 12 0
task_delay_info 7789 8211 80 51 1 : tunables 0 0 0 : slabdata 161 161 0
taskstats 184 184 344 23 2 : tunables 0 0 0 : slabdata 8 8 0
proc_dir_entry 995 1050 192 21 1 : tunables 0 0 0 : slabdata 50 50 0
pde_opener 6222 6222 40 102 1 : tunables 0 0 0 : slabdata 61 61 0
proc_inode_cache 3954 4824 664 24 4 : tunables 0 0 0 : slabdata 201 201 0
bdev_cache 312 312 832 39 8 : tunables 0 0 0 : slabdata 8 8 0
shmem_inode_cache 3043 3519 704 23 4 : tunables 0 0 0 : slabdata 153 153 0
kernfs_node_cache 43077 43230 136 30 1 : tunables 0 0 0 : slabdata 1441 1441 0
mnt_cache 905 975 320 25 2 : tunables 0 0 0 : slabdata 39 39 0
filp 10426 14112 256 32 2 : tunables 0 0 0 : slabdata 441 441 0
inode_cache 18132 20682 592 27 4 : tunables 0 0 0 : slabdata 766 766 0
dentry 23852 38010 192 21 1 : tunables 0 0 0 : slabdata 1810 1810 0
names_cache 64 64 4096 8 8 : tunables 0 0 0 : slabdata 8 8 0
buffer_head 13867 16614 104 39 1 : tunables 0 0 0 : slabdata 426 426 0
uts_namespace 296 296 440 37 4 : tunables 0 0 0 : slabdata 8 8 0
nsproxy 29480 173521 56 73 1 : tunables 0 0 0 : slabdata 2377 2377 0
mm_struct 3930 3960 1088 30 8 : tunables 0 0 0 : slabdata 132 132 0
files_cache 2248 2277 704 23 4 : tunables 0 0 0 : slabdata 99 99 0
signal_cache 3760 3813 1088 30 8 : tunables 0 0 0 : slabdata 128 128 0
sighand_cache 1771 1831 2112 15 8 : tunables 0 0 0 : slabdata 123 123 0
task_struct 978 1073 7744 4 8 : tunables 0 0 0 : slabdata 269 269 0
cred_jar 14389 17472 192 21 1 : tunables 0 0 0 : slabdata 832 832 0
anon_vma_chain 17504 21696 64 64 1 : tunables 0 0 0 : slabdata 339 339 0
anon_vma 11414 13570 88 46 1 : tunables 0 0 0 : slabdata 295 295 0
pid 7451 7904 128 32 1 : tunables 0 0 0 : slabdata 247 247 0
Acpi-Operand 4758 4760 72 56 1 : tunables 0 0 0 : slabdata 85 85 0
Acpi-ParseExt 396 663 104 39 1 : tunables 0 0 0 : slabdata 17 17 0
Acpi-State 627 663 80 51 1 : tunables 0 0 0 : slabdata 13 13 0
Acpi-Namespace 12428 12444 40 102 1 : tunables 0 0 0 : slabdata 122 122 0
numa_policy 1360 1360 24 170 1 : tunables 0 0 0 : slabdata 8 8 0
ftrace_event_field 4590 4590 48 85 1 : tunables 0 0 0 : slabdata 54 54 0
pool_workqueue 1106 1376 256 32 2 : tunables 0 0 0 : slabdata 43 43 0
radix_tree_node 11160 15176 584 28 4 : tunables 0 0 0 : slabdata 542 542 0
task_group 207 225 640 25 4 : tunables 0 0 0 : slabdata 9 9 0
vmap_area 3472 5106 88 46 1 : tunables 0 0 0 : slabdata 111 111 0
dma-kmalloc-8k 0 0 8192 4 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-4k 0 0 4096 8 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-2k 0 0 2048 16 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-1k 0 0 1024 32 8 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-512 32 32 512 32 4 : tunables 0 0 0 : slabdata 1 1 0
dma-kmalloc-256 0 0 256 32 2 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-128 0 0 128 32 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-64 0 0 64 64 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-32 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-16 0 0 16 256 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-8 0 0 8 512 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-192 0 0 192 21 1 : tunables 0 0 0 : slabdata 0 0 0
dma-kmalloc-96 0 0 96 42 1 : tunables 0 0 0 : slabdata 0 0 0
kmalloc-rcl-8k 0 0 8192 4 8 : tunables 0 0 0 : slabdata 0 0 0
kmalloc-rcl-4k 0 0 4096 8 8 : tunables 0 0 0 : slabdata 0 0 0
kmalloc-rcl-2k 0 0 2048 16 8 : tunables 0 0 0 : slabdata 0 0 0
kmalloc-rcl-1k 0 0 1024 32 8 : tunables 0 0 0 : slabdata 0 0 0
kmalloc-rcl-512 0 0 512 32 4 : tunables 0 0 0 : slabdata 0 0 0
kmalloc-rcl-256 0 0 256 32 2 : tunables 0 0 0 : slabdata 0 0 0
kmalloc-rcl-192 462 462 192 21 1 : tunables 0 0 0 : slabdata 22 22 0
kmalloc-rcl-128 1344 1472 128 32 1 : tunables 0 0 0 : slabdata 46 46 0
kmalloc-rcl-96 2312 2646 96 42 1 : tunables 0 0 0 : slabdata 63 63 0
kmalloc-rcl-64 4662 5504 64 64 1 : tunables 0 0 0 : slabdata 86 86 0
kmalloc-rcl-32 0 0 32 128 1 : tunables 0 0 0 : slabdata 0 0 0
kmalloc-rcl-16 0 0 16 256 1 : tunables 0 0 0 : slabdata 0 0 0
kmalloc-rcl-8 0 0 8 512 1 : tunables 0 0 0 : slabdata 0 0 0
kmalloc-8k 211275 211297 8192 4 8 : tunables 0 0 0 : slabdata 52987 52987 0
kmalloc-4k 998777 998849 4096 8 8 : tunables 0 0 0 : slabdata 432521 432521 0
kmalloc-2k 449013 449548 2048 16 8 : tunables 0 0 0 : slabdata 80413 80413 0
kmalloc-1k 7057 7748 1024 32 8 : tunables 0 0 0 : slabdata 243 243 0
kmalloc-512 10689 11360 512 32 4 : tunables 0 0 0 : slabdata 355 355 0
kmalloc-256 1928 2016 256 32 2 : tunables 0 0 0 : slabdata 63 63 0
kmalloc-192 6159 6888 192 21 1 : tunables 0 0 0 : slabdata 328 328 0
kmalloc-128 2026 2592 128 32 1 : tunables 0 0 0 : slabdata 81 81 0
kmalloc-96 4351 4704 96 42 1 : tunables 0 0 0 : slabdata 112 112 0
kmalloc-64 17837 20032 64 64 1 : tunables 0 0 0 : slabdata 313 313 0
kmalloc-32 30075 33536 32 128 1 : tunables 0 0 0 : slabdata 262 262 0
kmalloc-16 17834 19712 16 256 1 : tunables 0 0 0 : slabdata 77 77 0
kmalloc-8 18813 18944 8 512 1 : tunables 0 0 0 : slabdata 37 37 0
kmem_cache_node 1646 1728 64 64 1 : tunables 0 0 0 : slabdata 27 27 0
kmem_cache 1673 1764 448 36 4 : tunables 0 0 0 : slabdata 49 49 0
cat /proc/meminfo:
$ cat /proc/meminfo
MemTotal: 16337164 kB
MemFree: 657940 kB
MemAvailable: 588720 kB
Buffers: 26324 kB
Cached: 93976 kB
SwapCached: 20448 kB
Active: 71728 kB
Inactive: 95076 kB
Active(anon): 15712 kB
Inactive(anon): 31328 kB
Active(file): 56016 kB
Inactive(file): 63748 kB
Unevictable: 32 kB
Mlocked: 32 kB
SwapTotal: 34038772 kB
SwapFree: 33819892 kB
Dirty: 492 kB
Writeback: 0 kB
AnonPages: 34396 kB
Mapped: 30472 kB
Shmem: 536 kB
KReclaimable: 75884 kB
Slab: 6767732 kB
SReclaimable: 75884 kB
SUnreclaim: 6691848 kB
KernelStack: 6552 kB
PageTables: 4796 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 42207352 kB
Committed_AS: 980000 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 40596 kB
VmallocChunk: 0 kB
Percpu: 4960 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
DirectMap4k: 9117084 kB
DirectMap2M: 7600128 kB
DirectMap1G: 1048576 kB
Last edited by Wild Penguin (2019-10-05 11:06:17)
Offline
Ok, time for a recap - in case someone is interested or in the same boat.
First, a nice (or not so nice) picture (click for larger):
Here one can see quite a peculiar memory usage rise pattern - I bet no-one has ever seen stairs like that??? Well, me neither. To be honest, I just had a hunch and noticed the memory leak does *not* happen if I don't run Folding@Home. Here I've been running a script which toggles (pause/resume) folding every hour, when the computer was idle. This explains the peculiar pattern . Also, the toggling is happening at the beginning, but seems folding does not always trigger a leak.
I've noticed I can get the memory leak consistenly with folding, and it does not happen if I'm not folding.
About the "Missing get_user_page_done" -log snippets I've posted above: they do still happend, but I do not get a "hang" when it happends - apparently they have no ill effect on the computer I can notice as a user. Also, it seems they have nothing to do with the memory leak nor low memory usage (because of the leak). However I can see I never got them before upgrading to 5.3.1 (and 5.3.5 and beyond). Also, the first hangs which happened, happened when the memory was not low - and I haven't had any that kind of crashes since the first few, but this leak is consistent. EDIT: Also there seem to be a few peculiar kernel errors in the logs, which can be found here (got with journalctl | sed -n -e '/cut\ here/,/end\ trace/ p').
So, to recap: a few hangs / near-lockups, "Missing get_user_page_done" and a consistent memory leak with folding, all since upgrading to 5.3.1 (and beyond, so far). Out of these the hangs seem to be actually the least common issue, although that was the reason I posted in first place.
I've opened a bug report at kernel.org bugzilla.
Last edited by Wild Penguin (2019-10-09 16:30:26)
Offline
Is the issue still present in linux 5.4-rc2 or linux-amd-staging-drm-next-git?
AMDGPU bugs should be filed at https://bugs.freedesktop.org/ product DRI component DRM/AMDgpu
Offline
Hi loqs,
Yes, it seems the bug manifests also on linux-amd-staging-drm-next. I'm not at the OOM stage yet, but definitely something has eaten ~8GB of RAM while I was away for ~8-10 hours (and left the computer folding during that time).
Also, got a single kernel error in the log:
loka 09 17:48:23 ArkkiVille kernel: ---[] end trace 29917bfeb29e58ed ]---
loka 10 10:34:28 ArkkiVille kernel: ------------[] cut here ]------------
loka 10 10:34:28 ArkkiVille kernel: Missing get_user_page_done
loka 10 10:34:28 ArkkiVille kernel: WARNING: CPU: 0 PID: 44225 at drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:1009 amdgpu_ttm_backend_unbind+0x140/0x150 [amdgpu]
loka 10 10:34:28 ArkkiVille kernel: Modules linked in: vboxnetflt(OE) vboxnetadp(OE) vboxpci(OE) vboxdrv(OE) vhba(OE) rfcomm ip6table_filter ip6_tables iptable_filter cmac algif_hash algif_skcipher af_alg bnep nct6775 hwmon_vid uinput msr fuse nls_iso8859_1 nls_cp437 vfat fat intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp tda18271c2dd coretemp amdgpu kvm_intel kvm irqbypass snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio amd_iommu_v2 snd_hda_codec_hdmi crct10dif_pclmul crc32_pclmul snd_hda_intel snd_hda_codec btusb snd_hda_core btrtl gpu_sched btbcm snd_hwdep btintel i2c_algo_bit ghash_clmulni_intel bluetooth ttm snd_pcm wmi_bmof mxm_wmi drxk aesni_intel mei_hdcp snd_timer drm_kms_helper gpio_ich iTCO_wdt aes_x86_64 ddbridge iTCO_vendor_support ecdh_generic drm crypto_simd rfkill snd cryptd glue_helper agpgart intel_cstate dvb_core ecc soundcore input_leds mousedev joydev videobuf2_vmalloc videobuf2_memops videobuf2_common syscopyarea sysfillrect sysimgblt fb_sys_fops
loka 10 10:34:28 ArkkiVille kernel: intel_uncore videodev intel_rapl_perf mc i2c_i801 e1000e lpc_ich evdev mei_me mei mac_hid wmi nfsd auth_rpcgss nfs_acl lockd grace sunrpc overlay sg crypto_user ip_tables x_tables ext4 crc32c_generic hid_steam crc16 mbcache jbd2 sd_mod hid_logitech_dj sr_mod cdrom hid_generic usbhid hid ahci libahci libata xhci_pci crc32c_intel scsi_mod ehci_pci xhci_hcd ehci_hcd bcache crc64
loka 10 10:34:28 ArkkiVille kernel: CPU: 0 PID: 44225 Comm: FahCore_21 Tainted: G OE 5.3.0-rc3-amd-staging-drm-next-git-00852-g2d9d0e0dc643 #1
loka 10 10:34:28 ArkkiVille kernel: Hardware name: ASUS All Series/MAXIMUS VII GENE, BIOS 3503 04/18/2018
loka 10 10:34:28 ArkkiVille kernel: RIP: 0010:amdgpu_ttm_backend_unbind+0x140/0x150 [amdgpu]
loka 10 10:34:28 ArkkiVille kernel: Code: ef 48 39 c6 0f 85 f8 fe ff ff 80 3d 31 1b 41 00 00 0f 85 eb fe ff ff 48 c7 c7 a4 15 86 c1 c6 05 1d 1b 41 00 01 e8 2f 42 56 ee <0f> 0b e9 d1 fe ff ff 31 c0 eb cc 0f 1f 44 00 00 0f 1f 44 00 00 41
loka 10 10:34:28 ArkkiVille kernel: RSP: 0018:ffff9e800bf63978 EFLAGS: 00010286
loka 10 10:34:28 ArkkiVille kernel: RAX: 0000000000000000 RBX: ffff93d4c2b7c780 RCX: 0000000000000000
loka 10 10:34:28 ArkkiVille kernel: RDX: 0000000000000001 RSI: 0000000000000086 RDI: 00000000ffffffff
loka 10 10:34:28 ArkkiVille kernel: RBP: ffff93d4b3384f28 R08: 00000000000004db R09: 0000000000000001
loka 10 10:34:28 ArkkiVille kernel: R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
loka 10 10:34:28 ArkkiVille kernel: R13: ffff93d4c2b7c780 R14: ffff9e800bf63ac0 R15: ffff9e800bf63ac0
loka 10 10:34:28 ArkkiVille kernel: FS: 00007f0357f46700(0000) GS:ffff93d4cea00000(0000) knlGS:0000000000000000
loka 10 10:34:28 ArkkiVille kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
loka 10 10:34:28 ArkkiVille kernel: CR2: 00007f317d381000 CR3: 00000002d07d4004 CR4: 00000000001606f0
loka 10 10:34:28 ArkkiVille kernel: Call Trace:
loka 10 10:34:28 ArkkiVille kernel: ttm_tt_unbind+0x1d/0x30 [ttm]
loka 10 10:34:28 ArkkiVille kernel: ttm_bo_move_ttm+0x46/0x150 [ttm]
loka 10 10:34:28 ArkkiVille kernel: ttm_bo_handle_move_mem+0x583/0x5a0 [ttm]
loka 10 10:34:28 ArkkiVille kernel: ttm_bo_validate+0x13d/0x150 [ttm]
loka 10 10:34:28 ArkkiVille kernel: amdgpu_cs_list_validate+0x8e/0x150 [amdgpu]
loka 10 10:34:28 ArkkiVille kernel: amdgpu_cs_ioctl+0xd68/0x1d20 [amdgpu]
loka 10 10:34:28 ArkkiVille kernel: ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
loka 10 10:34:28 ArkkiVille kernel: drm_ioctl_kernel+0xb3/0x100 [drm]
loka 10 10:34:28 ArkkiVille kernel: drm_ioctl+0x23d/0x3d0 [drm]
loka 10 10:34:28 ArkkiVille kernel: ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
loka 10 10:34:28 ArkkiVille kernel: amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
loka 10 10:34:28 ArkkiVille kernel: do_vfs_ioctl+0x43d/0x6c0
loka 10 10:34:28 ArkkiVille kernel: ? syscall_trace_enter+0x1f2/0x2e0
loka 10 10:34:28 ArkkiVille kernel: ksys_ioctl+0x5e/0x90
loka 10 10:34:28 ArkkiVille kernel: __x64_sys_ioctl+0x16/0x20
loka 10 10:34:28 ArkkiVille kernel: do_syscall_64+0x5f/0x1c0
loka 10 10:34:28 ArkkiVille kernel: ? prepare_exit_to_usermode+0x85/0xb0
loka 10 10:34:28 ArkkiVille kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
loka 10 10:34:28 ArkkiVille kernel: RIP: 0033:0x7f0357a6a25b
loka 10 10:34:28 ArkkiVille kernel: Code: 0f 1e fa 48 8b 05 25 9c 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f5 9b 0c 00 f7 d8 64 89 01 48
loka 10 10:34:28 ArkkiVille kernel: RSP: 002b:00007f0357f45258 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
loka 10 10:34:28 ArkkiVille kernel: RAX: ffffffffffffffda RBX: 00007f0357f452c0 RCX: 00007f0357a6a25b
loka 10 10:34:28 ArkkiVille kernel: RDX: 00007f0357f452c0 RSI: 00000000c0186444 RDI: 0000000000000014
loka 10 10:34:28 ArkkiVille kernel: RBP: 00000000c0186444 R08: 00007f0357f45370 R09: 0000000000000010
loka 10 10:34:28 ArkkiVille kernel: R10: 00007f0357f45370 R11: 0000000000000246 R12: 00000000029504e0
loka 10 10:34:28 ArkkiVille kernel: R13: 0000000000000014 R14: 00007f0340185890 R15: 000000000410f080
loka 10 10:34:28 ArkkiVille kernel: ---[] end trace 2fceae28fe8e6841 ]---
But just to make sure, as I'm a bit confused: have I understood correctly, that this ebuild should result currently in a 5.3.0 -series kernel, but the amdgpu code is still newer (and the pkgver reflects this target version of the code)?
$ uname -r
5.3.0-rc3-amd-staging-drm-next-git-00852-g2d9d0e0dc643
$ pacman -Qs linux-amd-staging-drm-next-git
local/linux-amd-staging-drm-next-git 5.4.858456.2d9d0e0dc643-1
The Linux-amd-staging-drm-next-git kernel and modules
local/linux-amd-staging-drm-next-git-headers 5.4.858456.2d9d0e0dc643-1
Header files and scripts for building modules for Linux-amd-staging-drm-next-git kernel
If this is not correct, then I'm a bit confused as how one is supposed to use that ebuild. Should I edit the PKGBUILD somehow, and/or the Makefile?
(I've red the comments by pam-mroku and yurikoles in AUR...)
Last edited by Wild Penguin (2019-10-10 14:56:08)
Offline
You are understanding the pkgver correctly.
Offline
Ok, one (or two) more questions: what would you recommend, should I make a bug report in freedesktop.org ?
I was also thinking about making a git-bisect with the Linux Kernel. I've done it before (~10 years ago, don't remember what the bug was but it was fixed and I helped with the bisect IIRC :-) ), but in this case, the bug manifests only after a few hours, which can make things a bit more time consuming - except, in the case the unreferenced objects in kmemleak are correct (I mean, indicative of the bug) and don't appear on a correctly working kernel. I believe those appeared as soon as I started folding (though I haven't confirmed this, nor tested on an older Kernel yet). So, would a git-bisect help?
In case you think git-bisect is a good idea, which would be more helpful: should I use the main Kernel tree or https://cgit.freedesktop.org/~agd5f/linux/ for the bisect? I believe I could use either one - in https://cgit.freedesktop.org/~agd5f/linux/refs/tags I presume 5.2 should work, and in main Kernel, I'd start somewhere <5.3.1 ...
Offline
Hi!
For some reason the leaks just don't happen anymore. I was planning on doing a bisect, but never got the time to do it and seems whatever it was, is now fixed .
Perpahs there was some problems (caused by some work / refactoring being done in the driver???) but they were fixed in a more recent Kernel (Currently running 5.3.8).
Markin as [SOLVED]! (will reopen if the leaks resurface and in case I see additional discussion here appropriate; I will also report / close the Kernel bugzilla bug report soon, just will wait a few days to make sure, so I don't generate unnecessary noise in there because of reopens)
Last edited by Wild Penguin (2019-11-10 17:49:01)
Offline