You are not logged in.

#1 2019-10-04 14:37:54

Wild Penguin
Member
Registered: 2015-03-19
Posts: 320

[SOLVED] AMDGPU Kernel module hangs when idle since (2019-09-27)

Hi!

Seemingly randomly, but always when I've left my computer idling (never when it is actually being used), I've had some AMDGPU related hangs since last friday.

I checked and seems that problems started since upgrading to kernel-5.3.1 (zen branch). Anyone with similar problems?

Most often there are kernel messages similar to this:

syys 30 22:33:08 ArkkiVille kernel: ------------[ cut here ]------------
syys 30 22:33:08 ArkkiVille kernel: Missing get_user_page_done
syys 30 22:33:08 ArkkiVille kernel: WARNING: CPU: 3 PID: 2760 at drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:992 amdgpu_ttm_backend_unbind+0x140/0x150 [amdgpu]
syys 30 22:33:08 ArkkiVille kernel: Modules linked in: rfcomm ip6table_filter ip6_tables iptable_filter cmac algif_hash algif_skcipher af_alg bnep msr uinput nct6775 hwmon_vid intel_rapl_msr intel_rapl_common fuse x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nls_iso8859_1 irqbypass nls_cp437 vfat fat crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper intel_cstate intel_uncore intel_rapl_perf btusb btrtl btbcm btintel hid_steam wmi_bmof bluetooth snd_hda_codec_realtek snd_hda_codec_generic ecdh_generic ledtrig_audio rfkill tda18271c2dd snd_hda_codec_hdmi snd_hda_intel snd_hda_codec input_leds snd_hda_core joydev mousedev snd_hwdep ecc snd_pcm snd_timer drxk snd mei_hdcp iTCO_wdt mxm_wmi iTCO_vendor_support gpio_ich soundcore ddbridge evdev mei_me dvb_core mei mac_hid videobuf2_vmalloc videobuf2_memops videobuf2_common videodev mc e1000e lpc_ich i2c_i801 wmi vboxnetflt(OE) vboxnetadp(OE) vboxpci(OE) vboxdrv(OE) overlay sg nfsd auth_rpcgss vhba
syys 30 22:33:08 ArkkiVille kernel:  nfs_acl lockd grace sunrpc crypto_user ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 sr_mod cdrom hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid sd_mod ahci libahci libata xhci_pci crc32c_intel scsi_mod xhci_hcd ehci_pci ehci_hcd amdgpu gpu_sched i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm agpgart bcache crc64
syys 30 22:33:08 ArkkiVille kernel: CPU: 3 PID: 2760 Comm: FahCore_21 Tainted: G           OE     5.3.1-zen1-1-zen #1
syys 30 22:33:08 ArkkiVille kernel: Hardware name: ASUS All Series/MAXIMUS VII GENE, BIOS 3503 04/18/2018
syys 30 22:33:08 ArkkiVille kernel: RIP: 0010:amdgpu_ttm_backend_unbind+0x140/0x150 [amdgpu]
syys 30 22:33:08 ArkkiVille kernel: Code: d7 48 39 c6 0f 85 f8 fe ff ff 80 3d b1 44 43 00 00 0f 85 eb fe ff ff 48 c7 c7 43 68 68 c0 c6 05 9d 44 43 00 01 e8 8f 0e d8 d5 <0f> 0b e9 d1 fe ff ff 31 c0 eb cc 0f 1f 44 00 00 0f 1f 44 00 00 41
syys 30 22:33:08 ArkkiVille kernel: RSP: 0018:ffff983ecb127980 EFLAGS: 00010282
syys 30 22:33:08 ArkkiVille kernel: RAX: 0000000000000000 RBX: ffff909df6c21c00 RCX: 0000000000000000
syys 30 22:33:08 ArkkiVille kernel: RDX: 0000000000000001 RSI: 0000000000000096 RDI: 00000000ffffffff
syys 30 22:33:08 ArkkiVille kernel: RBP: ffff909ec6784088 R08: 00000000000004e8 R09: 0000000000000001
syys 30 22:33:08 ArkkiVille kernel: R10: 0000000000000001 R11: 0000000000002f5c R12: 0000000000000000
syys 30 22:33:08 ArkkiVille kernel: R13: ffff909df6c21c00 R14: ffff983ecb127ac0 R15: 0000000000000000
syys 30 22:33:08 ArkkiVille kernel: FS:  00007fdfa0283700(0000) GS:ffff909eceac0000(0000) knlGS:0000000000000000
syys 30 22:33:08 ArkkiVille kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
syys 30 22:33:08 ArkkiVille kernel: CR2: 00007f1324ec9ffc CR3: 0000000346e38005 CR4: 00000000001606e0
syys 30 22:33:08 ArkkiVille kernel: Call Trace:
syys 30 22:33:08 ArkkiVille kernel:  ttm_tt_unbind+0x1d/0x30 [ttm]
syys 30 22:33:08 ArkkiVille kernel:  ttm_bo_move_ttm+0x43/0x120 [ttm]
syys 30 22:33:08 ArkkiVille kernel:  ttm_bo_handle_move_mem+0x55a/0x580 [ttm]
syys 30 22:33:08 ArkkiVille kernel:  ttm_bo_validate+0x26b/0x2d0 [ttm]
syys 30 22:33:08 ArkkiVille kernel:  amdgpu_cs_list_validate+0xd9/0x180 [amdgpu]
syys 30 22:33:08 ArkkiVille kernel:  amdgpu_cs_ioctl+0xc83/0x1e50 [amdgpu]
syys 30 22:33:08 ArkkiVille kernel:  ? _raw_spin_lock_irqsave+0x26/0x50
syys 30 22:33:08 ArkkiVille kernel:  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
syys 30 22:33:08 ArkkiVille kernel:  drm_ioctl_kernel+0xb8/0x100 [drm]
syys 30 22:33:08 ArkkiVille kernel:  drm_ioctl+0x253/0x3f0 [drm]
syys 30 22:33:08 ArkkiVille kernel:  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
syys 30 22:33:08 ArkkiVille kernel:  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
syys 30 22:33:08 ArkkiVille kernel:  do_vfs_ioctl+0x43d/0x7a0
syys 30 22:33:08 ArkkiVille kernel:  ? syscall_trace_enter+0x1f2/0x2d0
syys 30 22:33:08 ArkkiVille kernel:  __x64_sys_ioctl+0x62/0x90
syys 30 22:33:08 ArkkiVille kernel:  do_syscall_64+0x5f/0x1c0
syys 30 22:33:08 ArkkiVille kernel:  ? prepare_exit_to_usermode+0x85/0xb0
syys 30 22:33:08 ArkkiVille kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
syys 30 22:33:08 ArkkiVille kernel: RIP: 0033:0x7fdf9fda921b
syys 30 22:33:08 ArkkiVille kernel: Code: 0f 1e fa 48 8b 05 75 8c 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 45 8c 0c 00 f7 d8 64 89 01 48
syys 30 22:33:08 ArkkiVille kernel: RSP: 002b:00007fdfa0282598 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
syys 30 22:33:08 ArkkiVille kernel: RAX: ffffffffffffffda RBX: 00007fdfa0282600 RCX: 00007fdf9fda921b
syys 30 22:33:08 ArkkiVille kernel: RDX: 00007fdfa0282600 RSI: 00000000c0186444 RDI: 0000000000000010
syys 30 22:33:08 ArkkiVille kernel: RBP: 00000000c0186444 R08: 00007fdfa02826b0 R09: 0000000000000010
syys 30 22:33:08 ArkkiVille kernel: R10: 00007fdfa02826b0 R11: 0000000000000246 R12: 00000000011a6a20
syys 30 22:33:08 ArkkiVille kernel: R13: 0000000000000010 R14: 00007fdf88242630 R15: 000000002c5551e0
syys 30 22:33:08 ArkkiVille kernel: ---[ end trace db7062d8128318e3 ]---

These are present every time the hang happends, except once (and I've had around five or six so far).

The hang: screen stays black, it can not wake up. UI is "hung" otherwise, too - I can not do anything via keyboard (or mouse), not even blindly (switch to VT, log in - even reboot via CTRL+ALT+DEL does not work); no reaction from CAPS/SCROLL/NumLock leds (however I'm running ckb-next which can sometimes block keys, if the system is hang in some way which prevents the process of ckb-next from executing properly - but I can still see some color LED animation on the keyboard, so it seems ckb-next is running properly and should not be blocking input)

I can SSH into the computer and shut down (and do other stuff, too) but shutdown will take ages (around 15 minutes). I can see in the log there that long a delay from "Journal stopped" to actual reboot. Not sure what is going on in the meantime, since logging has been stopped - and monitor will remain black so I can not see if there are some kernel messages! Many times I've actually synced via SysRQ since I didn't want to wait until actual reboot (or more specifically, SysRQ+SSSSSSREISU [wait...] B), but just once I waited since I figured it would reboot eventually ... and it did.

One thing worth pointing out is that I'm folding in the background all the time (unless doing something else which is GPU intensive, in which case I will pause folding temporarily so it will not disrupt my gaming session business session)!

Thanks for any input!

EDIT: OOPS! Yet again I may have chosen the wrong section. Kernel&HW would have been more appropriate!

Last edited by Wild Penguin (2019-11-10 17:45:23)

Offline

#2 2019-10-05 11:04:57

Wild Penguin
Member
Registered: 2015-03-19
Posts: 320

Re: [SOLVED] AMDGPU Kernel module hangs when idle since (2019-09-27)

Ok, this is weird. Problems might not be with amdgpu after all. Instead there seems to be a kernel memory leak!

$ free -h
              total        used        free      shared  buff/cache   available
Mem:           15Gi        14Gi       567Mi       0,0Ki       276Mi       542Mi
Swap:          32Gi       215Mi        32Gi

I can see these problems started around last Friday (after kernel upgrade). I've also noticed earlyoom (which I've installed since occasionallyso firefox and certain beta software I've been playing with won't grind the whole system to a near-halt). Because of earlyoom, I can see from the logs it is trying to kill processes when system memory runs low - and from it's status output I can see something is slowly eating RAM! However, as earlyOOM tries to kil foldingathome, I believe this causes amdgpu driver to crash (I need to check previous logs to be sure). However: from top (or htop) I can not see a process which is eating this memory! [EDIT:]: Also, earlyoom is trying to kill the wrong culprit, and solve nothing - after killing processes mem usage is still high!

Now I'm trying to:

  1. look via Google how to diagnose low memory issues when it is not caused by a user-space process and

  2. trying to learn how to use /proc/slabinfo and slabtop.

It is difficult to find information via Google about low memory issues, since most hits point into newbie(ish) users who mistake cache/buffer usage for low memory (which is not the case here, memory is actually low). There is probable actually usefull information out there but it is difficult to find among those hits.

My plan: switch back to mainline (non -zen) kernel. Zen has debugging features, such as kernel memleak testing, disabled (to increase responsiveness for desktop users). I'll see if the memory leaks will go away by the switch - then I know zen kernel is the actual culprit. If not, I have at least something new to learn (how to diagnose actual Kernel memory leaks)...

This is just FYI in case someone seems to be having a similar problem.

Also, any input how to proceed to find the actual cuplrit is welcome.

Also, realised I put kind of little information in my OP, here is some:

$ sudo inxi -mFxz
System:    Host: ArkkiVille Kernel: 5.3.1-zen1-1-zen x86_64 bits: 64 compiler: gcc v: 9.1.0 Console: tty 4 Distro: Arch Linux 
Machine:   Type: Desktop System: ASUS product: All Series v: N/A serial: N/A 
           Mobo: ASUSTeK model: MAXIMUS VII GENE v: Rev 1.xx serial: <filter> UEFI: American Megatrends v: 3503 
           date: 04/18/2018 
Memory:    RAM: total: 15.58 GiB used: 14.86 GiB (95.4%) 
           Array-1: capacity: 32 GiB slots: 4 EC: None max module size: 8 GiB note: est. 
           Device-1: DIMM_A1 size: No Module Installed 
           Device-2: DIMM_A2 size: 8 GiB speed: 1333 MT/s type: DDR3 
           Device-3: DIMM_B1 size: No Module Installed 
           Device-4: DIMM_B2 size: 8 GiB speed: 1333 MT/s type: DDR3 
CPU:       Topology: Quad Core model: Intel Core i7-4790K bits: 64 type: MT MCP arch: Haswell rev: 3 L2 cache: 8192 KiB 
           flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx bogomips: 64111 
           Speed: 4008 MHz min/max: 800/4400 MHz Core speeds (MHz): 1: 4008 2: 4008 3: 4009 4: 4008 5: 4012 6: 4007 7: 4009 
           8: 4013 
Graphics:  Device-1: Advanced Micro Devices [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] driver: amdgpu v: kernel 
           bus ID: 03:00.0 
           Display: server: X.org 1.20.5 driver: amdgpu unloaded: modesetting tty: 157x50 
           Message: Advanced graphics data unavailable in console for root. 
Audio:     Device-1: Intel 9 Series Family HD Audio vendor: ASUSTeK driver: snd_hda_intel v: kernel bus ID: 00:1b.0 
           Device-2: Advanced Micro Devices [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] driver: snd_hda_intel v: kernel 
           bus ID: 03:00.1 
           Device-3: Digital Devices Octopus DVB Adapter driver: ddbridge v: 0.9.33-integrated bus ID: 05:00.0 
           Sound Server: ALSA v: k5.3.1-zen1-1-zen 
Network:   Device-1: Intel Ethernet I218-V vendor: ASUSTeK driver: e1000e v: 3.2.6-k port: f040 bus ID: 00:19.0 
           IF: eno1 state: up speed: 1000 Mbps duplex: full mac: <filter> 
Drives:    Local Storage: total: 17.74 TiB used: 14.32 TiB (80.7%) 
           ID-1: /dev/nvme0n1 vendor: Samsung model: SSD 960 EVO 500GB size: 465.76 GiB 
           ID-2: /dev/sda vendor: Western Digital model: WD30EFRX-68AX9N0 size: 2.73 TiB temp: 38 C 
           ID-3: /dev/sdb vendor: Western Digital model: WD40EFRX-68WT0N0 size: 3.64 TiB 
           ID-4: /dev/sdc vendor: Western Digital model: WD80EFAX-68KNBN0 size: 7.28 TiB temp: 50 C 
           ID-5: /dev/sdd vendor: Western Digital model: WD40EFRX-68WT0N0 size: 3.64 TiB temp: 40 C 
Partition: ID-1: / size: 3.65 TiB used: 632.71 GiB (16.9%) fs: ext4 dev: /dev/bcache2 
           ID-2: /boot size: 547.9 MiB used: 121.1 MiB (22.1%) fs: vfat dev: /dev/nvme0n1p1 
           ID-3: swap-1 size: 32.46 GiB used: 212.2 MiB (0.6%) fs: swap dev: /dev/nvme0n1p2 
Sensors:   System Temperatures: cpu: 38.0 C mobo: 37.0 C gpu: amdgpu temp: 29 C 
           Fan Speeds (RPM): N/A gpu: amdgpu fan: 0 
           Voltages: 12v: 12.10 5v: 5.04 3.3v: 3.31 vbat: 3.28 
Info:      Processes: 298 Uptime: 22h 58m Init: systemd Compilers: gcc: 9.1.0 clang: 8.0.1 Shell: bash v: 5.0.11 inxi: 3.0.36 

cat /proc/slabinfo (while mem usage is high):

slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
fuse_inode          1053   1209    832   39    8 : tunables    0    0    0 : slabdata     31     31      0
fat_inode_cache       88     88    728   22    4 : tunables    0    0    0 : slabdata      4      4      0
fat_cache              0      0     40  102    1 : tunables    0    0    0 : slabdata      0      0      0
kvm_async_pf           0      0    136   30    1 : tunables    0    0    0 : slabdata      0      0      0
kvm_vcpu               0      0  16256    2    8 : tunables    0    0    0 : slabdata      0      0      0
kvm_mmu_page_header      0      0    160   25    1 : tunables    0    0    0 : slabdata      0      0      0
x86_fpu                0      0   4160    7    8 : tunables    0    0    0 : slabdata      0      0      0
ovl_inode              0      0    672   24    4 : tunables    0    0    0 : slabdata      0      0      0
nfs4_layout_stateid      0      0    280   29    2 : tunables    0    0    0 : slabdata      0      0      0
nfsd4_delegations      0      0    232   35    2 : tunables    0    0    0 : slabdata      0      0      0
nfsd4_stateids         0      0    168   24    1 : tunables    0    0    0 : slabdata      0      0      0
nfsd4_files            0      0    288   28    2 : tunables    0    0    0 : slabdata      0      0      0
nfsd4_lockowners     160    180    392   20    2 : tunables    0    0    0 : slabdata      9      9      0
nfsd4_openowners       0      0    432   37    4 : tunables    0    0    0 : slabdata      0      0      0
nfsd4_clients          0      0   1296   25    8 : tunables    0    0    0 : slabdata      0      0      0
rpc_inode_cache       51     75    640   25    4 : tunables    0    0    0 : slabdata      3      3      0
ext4_groupinfo_4k 141624 141624    144   28    1 : tunables    0    0    0 : slabdata   5058   5058      0
ext4_inode_cache    7526  12492   1080   30    8 : tunables    0    0    0 : slabdata    429    429      0
ext4_allocation_context    256    256    128   32    1 : tunables    0    0    0 : slabdata      8      8      0
ext4_extent_status    862   1224     40  102    1 : tunables    0    0    0 : slabdata     12     12      0
mbcache              584    876     56   73    1 : tunables    0    0    0 : slabdata     12     12      0
jbd2_journal_head    272    408    120   34    1 : tunables    0    0    0 : slabdata     12     12      0
jbd2_revoke_table_s   1024   1024     16  256    1 : tunables    0    0    0 : slabdata      4      4      0
jbd2_revoke_record_s   1024   1536     32  128    1 : tunables    0    0    0 : slabdata     12     12      0
scsi_sense_cache     736    736    128   32    1 : tunables    0    0    0 : slabdata     23     23      0
search               248    351    592   27    4 : tunables    0    0    0 : slabdata     13     13      0
ip6-frags              0      0    184   22    1 : tunables    0    0    0 : slabdata      0      0      0
PINGv6                 0      0   1216   26    8 : tunables    0    0    0 : slabdata      0      0      0
RAWv6                390    442   1216   26    8 : tunables    0    0    0 : slabdata     17     17      0
UDPv6               1251   1275   1280   25    8 : tunables    0    0    0 : slabdata     51     51      0
tw_sock_TCPv6        272    272    240   34    2 : tunables    0    0    0 : slabdata      8      8      0
request_sock_TCPv6    208    208    304   26    2 : tunables    0    0    0 : slabdata      8      8      0
TCPv6                482    520   2368   13    8 : tunables    0    0    0 : slabdata     40     40      0
bfq_io_cq            508    900    160   25    1 : tunables    0    0    0 : slabdata     36     36      0
mqueue_inode_cache    340    340    960   34    8 : tunables    0    0    0 : slabdata     10     10      0
fscrypt_info         512    768     64   64    1 : tunables    0    0    0 : slabdata     12     12      0
fscrypt_ctx          680    680     48   85    1 : tunables    0    0    0 : slabdata      8      8      0
dnotify_struct         0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
pid_namespace        195    195    208   39    2 : tunables    0    0    0 : slabdata      5      5      0
posix_timers_cache      0      0    240   34    2 : tunables    0    0    0 : slabdata      0      0      0
UNIX                6082   6244   1024   32    8 : tunables    0    0    0 : slabdata    196    196      0
ip4-frags             20     20    200   20    1 : tunables    0    0    0 : slabdata      1      1      0
xfrm_state             0      0    704   23    4 : tunables    0    0    0 : slabdata      0      0      0
PING                   0      0    960   34    8 : tunables    0    0    0 : slabdata      0      0      0
RAW                  480    544   1024   32    8 : tunables    0    0    0 : slabdata     17     17      0
tw_sock_TCP          272    340    240   34    2 : tunables    0    0    0 : slabdata     10     10      0
request_sock_TCP     208    208    304   26    2 : tunables    0    0    0 : slabdata      8      8      0
TCP                  478    574   2240   14    8 : tunables    0    0    0 : slabdata     41     41      0
hugetlbfs_inode_cache     52     52    616   26    4 : tunables    0    0    0 : slabdata      2      2      0
dquot                256    256    256   32    2 : tunables    0    0    0 : slabdata      8      8      0
eventpoll_pwq       3920   4144     72   56    1 : tunables    0    0    0 : slabdata     74     74      0
dax_cache             21     21    768   21    4 : tunables    0    0    0 : slabdata      1      1      0
request_queue        105    105   2088   15    8 : tunables    0    0    0 : slabdata      7      7      0
biovec-max           106    129   4096    8    8 : tunables    0    0    0 : slabdata     17     17      0
biovec-128           136    144   2048   16    8 : tunables    0    0    0 : slabdata      9      9      0
biovec-64            256    256   1024   32    8 : tunables    0    0    0 : slabdata      8      8      0
user_namespace       210    210    536   30    4 : tunables    0    0    0 : slabdata      7      7      0
dmaengine-unmap-256     15     15   2112   15    8 : tunables    0    0    0 : slabdata      1      1      0
dmaengine-unmap-128     30     30   1088   30    8 : tunables    0    0    0 : slabdata      1      1      0
dmaengine-unmap-16    301    420    192   21    1 : tunables    0    0    0 : slabdata     20     20      0
dmaengine-unmap-2 178466 247744     64   64    1 : tunables    0    0    0 : slabdata   3871   3871      0
sock_inode_cache    8126   8350    832   39    8 : tunables    0    0    0 : slabdata    215    215      0
skbuff_ext_cache     993   1216    128   32    1 : tunables    0    0    0 : slabdata     38     38      0
skbuff_fclone_cache    443   1088    512   32    4 : tunables    0    0    0 : slabdata     34     34      0
skbuff_head_cache   6957   7296    256   32    2 : tunables    0    0    0 : slabdata    228    228      0
configfs_dir_cache     42     42     96   42    1 : tunables    0    0    0 : slabdata      1      1      0
file_lock_cache      296    296    216   37    2 : tunables    0    0    0 : slabdata      8      8      0
fsnotify_mark_connector   1140   1920     32  128    1 : tunables    0    0    0 : slabdata     15     15      0
net_namespace         72     72   4864    6    8 : tunables    0    0    0 : slabdata     12     12      0
task_delay_info     7789   8211     80   51    1 : tunables    0    0    0 : slabdata    161    161      0
taskstats            184    184    344   23    2 : tunables    0    0    0 : slabdata      8      8      0
proc_dir_entry       995   1050    192   21    1 : tunables    0    0    0 : slabdata     50     50      0
pde_opener          6222   6222     40  102    1 : tunables    0    0    0 : slabdata     61     61      0
proc_inode_cache    3954   4824    664   24    4 : tunables    0    0    0 : slabdata    201    201      0
bdev_cache           312    312    832   39    8 : tunables    0    0    0 : slabdata      8      8      0
shmem_inode_cache   3043   3519    704   23    4 : tunables    0    0    0 : slabdata    153    153      0
kernfs_node_cache  43077  43230    136   30    1 : tunables    0    0    0 : slabdata   1441   1441      0
mnt_cache            905    975    320   25    2 : tunables    0    0    0 : slabdata     39     39      0
filp               10426  14112    256   32    2 : tunables    0    0    0 : slabdata    441    441      0
inode_cache        18132  20682    592   27    4 : tunables    0    0    0 : slabdata    766    766      0
dentry             23852  38010    192   21    1 : tunables    0    0    0 : slabdata   1810   1810      0
names_cache           64     64   4096    8    8 : tunables    0    0    0 : slabdata      8      8      0
buffer_head        13867  16614    104   39    1 : tunables    0    0    0 : slabdata    426    426      0
uts_namespace        296    296    440   37    4 : tunables    0    0    0 : slabdata      8      8      0
nsproxy            29480 173521     56   73    1 : tunables    0    0    0 : slabdata   2377   2377      0
mm_struct           3930   3960   1088   30    8 : tunables    0    0    0 : slabdata    132    132      0
files_cache         2248   2277    704   23    4 : tunables    0    0    0 : slabdata     99     99      0
signal_cache        3760   3813   1088   30    8 : tunables    0    0    0 : slabdata    128    128      0
sighand_cache       1771   1831   2112   15    8 : tunables    0    0    0 : slabdata    123    123      0
task_struct          978   1073   7744    4    8 : tunables    0    0    0 : slabdata    269    269      0
cred_jar           14389  17472    192   21    1 : tunables    0    0    0 : slabdata    832    832      0
anon_vma_chain     17504  21696     64   64    1 : tunables    0    0    0 : slabdata    339    339      0
anon_vma           11414  13570     88   46    1 : tunables    0    0    0 : slabdata    295    295      0
pid                 7451   7904    128   32    1 : tunables    0    0    0 : slabdata    247    247      0
Acpi-Operand        4758   4760     72   56    1 : tunables    0    0    0 : slabdata     85     85      0
Acpi-ParseExt        396    663    104   39    1 : tunables    0    0    0 : slabdata     17     17      0
Acpi-State           627    663     80   51    1 : tunables    0    0    0 : slabdata     13     13      0
Acpi-Namespace     12428  12444     40  102    1 : tunables    0    0    0 : slabdata    122    122      0
numa_policy         1360   1360     24  170    1 : tunables    0    0    0 : slabdata      8      8      0
ftrace_event_field   4590   4590     48   85    1 : tunables    0    0    0 : slabdata     54     54      0
pool_workqueue      1106   1376    256   32    2 : tunables    0    0    0 : slabdata     43     43      0
radix_tree_node    11160  15176    584   28    4 : tunables    0    0    0 : slabdata    542    542      0
task_group           207    225    640   25    4 : tunables    0    0    0 : slabdata      9      9      0
vmap_area           3472   5106     88   46    1 : tunables    0    0    0 : slabdata    111    111      0
dma-kmalloc-8k         0      0   8192    4    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-4k         0      0   4096    8    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-2k         0      0   2048   16    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-1k         0      0   1024   32    8 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-512       32     32    512   32    4 : tunables    0    0    0 : slabdata      1      1      0
dma-kmalloc-256        0      0    256   32    2 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-128        0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-64         0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-32         0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-16         0      0     16  256    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-8          0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-192        0      0    192   21    1 : tunables    0    0    0 : slabdata      0      0      0
dma-kmalloc-96         0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-rcl-8k         0      0   8192    4    8 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-rcl-4k         0      0   4096    8    8 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-rcl-2k         0      0   2048   16    8 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-rcl-1k         0      0   1024   32    8 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-rcl-512        0      0    512   32    4 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-rcl-256        0      0    256   32    2 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-rcl-192      462    462    192   21    1 : tunables    0    0    0 : slabdata     22     22      0
kmalloc-rcl-128     1344   1472    128   32    1 : tunables    0    0    0 : slabdata     46     46      0
kmalloc-rcl-96      2312   2646     96   42    1 : tunables    0    0    0 : slabdata     63     63      0
kmalloc-rcl-64      4662   5504     64   64    1 : tunables    0    0    0 : slabdata     86     86      0
kmalloc-rcl-32         0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-rcl-16         0      0     16  256    1 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-rcl-8          0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
kmalloc-8k        211275 211297   8192    4    8 : tunables    0    0    0 : slabdata  52987  52987      0
kmalloc-4k        998777 998849   4096    8    8 : tunables    0    0    0 : slabdata 432521 432521      0
kmalloc-2k        449013 449548   2048   16    8 : tunables    0    0    0 : slabdata  80413  80413      0
kmalloc-1k          7057   7748   1024   32    8 : tunables    0    0    0 : slabdata    243    243      0
kmalloc-512        10689  11360    512   32    4 : tunables    0    0    0 : slabdata    355    355      0
kmalloc-256         1928   2016    256   32    2 : tunables    0    0    0 : slabdata     63     63      0
kmalloc-192         6159   6888    192   21    1 : tunables    0    0    0 : slabdata    328    328      0
kmalloc-128         2026   2592    128   32    1 : tunables    0    0    0 : slabdata     81     81      0
kmalloc-96          4351   4704     96   42    1 : tunables    0    0    0 : slabdata    112    112      0
kmalloc-64         17837  20032     64   64    1 : tunables    0    0    0 : slabdata    313    313      0
kmalloc-32         30075  33536     32  128    1 : tunables    0    0    0 : slabdata    262    262      0
kmalloc-16         17834  19712     16  256    1 : tunables    0    0    0 : slabdata     77     77      0
kmalloc-8          18813  18944      8  512    1 : tunables    0    0    0 : slabdata     37     37      0
kmem_cache_node     1646   1728     64   64    1 : tunables    0    0    0 : slabdata     27     27      0
kmem_cache          1673   1764    448   36    4 : tunables    0    0    0 : slabdata     49     49      0

cat /proc/meminfo:

$ cat /proc/meminfo 
MemTotal:       16337164 kB
MemFree:          657940 kB
MemAvailable:     588720 kB
Buffers:           26324 kB
Cached:            93976 kB
SwapCached:        20448 kB
Active:            71728 kB
Inactive:          95076 kB
Active(anon):      15712 kB
Inactive(anon):    31328 kB
Active(file):      56016 kB
Inactive(file):    63748 kB
Unevictable:          32 kB
Mlocked:              32 kB
SwapTotal:      34038772 kB
SwapFree:       33819892 kB
Dirty:               492 kB
Writeback:             0 kB
AnonPages:         34396 kB
Mapped:            30472 kB
Shmem:               536 kB
KReclaimable:      75884 kB
Slab:            6767732 kB
SReclaimable:      75884 kB
SUnreclaim:      6691848 kB
KernelStack:        6552 kB
PageTables:         4796 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    42207352 kB
Committed_AS:     980000 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       40596 kB
VmallocChunk:          0 kB
Percpu:             4960 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:     9117084 kB
DirectMap2M:     7600128 kB
DirectMap1G:     1048576 kB

Last edited by Wild Penguin (2019-10-05 11:06:17)

Offline

#3 2019-10-09 15:59:38

Wild Penguin
Member
Registered: 2015-03-19
Posts: 320

Re: [SOLVED] AMDGPU Kernel module hangs when idle since (2019-09-27)

Ok, time for a recap - in case someone is interested or in the same boat.

First, a nice (or not so nice) picture (click for larger):

Monitorix view of a memory leak

Here one can see quite a peculiar memory usage rise pattern - I bet no-one has ever seen stairs like that??? Well, me neither. To be honest, I just had a hunch and noticed the memory leak does *not* happen if I don't run Folding@Home. Here I've been running a script which toggles (pause/resume) folding every hour, when the computer was idle. This explains the peculiar pattern smile. Also, the toggling is happening at the beginning, but seems folding does not always trigger a leak.

I've noticed I can get the memory leak consistenly with folding, and it does not happen if I'm not folding.

About the "Missing get_user_page_done" -log snippets I've posted above: they do still happend, but I do not get a "hang" when it happends - apparently they have no ill effect on the computer I can notice as a user. Also, it seems they have nothing to do with the memory leak nor low memory usage (because of the leak). However I can see I never got them before upgrading to 5.3.1 (and 5.3.5 and beyond). Also, the first hangs which happened, happened when the memory was not low - and I haven't had any that kind of crashes since the first few, but this leak is consistent. EDIT: Also there seem to be a few peculiar kernel errors in the logs, which can be found here (got with journalctl | sed -n -e '/cut\ here/,/end\ trace/ p').

So, to recap: a few hangs / near-lockups, "Missing get_user_page_done" and a consistent memory leak with folding, all since upgrading to 5.3.1 (and beyond, so far). Out of these the hangs seem to be actually the least common issue, although that was the reason I posted in first place.

I've opened a bug report at kernel.org bugzilla.

Last edited by Wild Penguin (2019-10-09 16:30:26)

Offline

#4 2019-10-09 16:06:21

loqs
Member
Registered: 2014-03-06
Posts: 17,375

Re: [SOLVED] AMDGPU Kernel module hangs when idle since (2019-09-27)

Is the issue still present in linux 5.4-rc2 or linux-amd-staging-drm-next-git?
AMDGPU bugs should be filed at https://bugs.freedesktop.org/ product DRI component DRM/AMDgpu

Offline

#5 2019-10-10 14:13:50

Wild Penguin
Member
Registered: 2015-03-19
Posts: 320

Re: [SOLVED] AMDGPU Kernel module hangs when idle since (2019-09-27)

Hi loqs,

Yes, it seems the bug manifests also on linux-amd-staging-drm-next. I'm not at the OOM stage yet, but definitely something has eaten ~8GB of RAM while I was away for ~8-10 hours (and left the computer folding during that time).

Also, got a single kernel error in the log:

loka 09 17:48:23 ArkkiVille kernel: ---[] end trace 29917bfeb29e58ed ]--- 
loka 10 10:34:28 ArkkiVille kernel: ------------[] cut here ]------------ 
loka 10 10:34:28 ArkkiVille kernel: Missing get_user_page_done 
loka 10 10:34:28 ArkkiVille kernel: WARNING: CPU: 0 PID: 44225 at drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:1009 amdgpu_ttm_backend_unbind+0x140/0x150 [amdgpu] 
loka 10 10:34:28 ArkkiVille kernel: Modules linked in: vboxnetflt(OE) vboxnetadp(OE) vboxpci(OE) vboxdrv(OE) vhba(OE) rfcomm ip6table_filter ip6_tables iptable_filter cmac algif_hash algif_skcipher af_alg bnep nct6775 hwmon_vid uinput msr fuse nls_iso8859_1 nls_cp437 vfat fat intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp tda18271c2dd coretemp amdgpu kvm_intel kvm irqbypass snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio amd_iommu_v2 snd_hda_codec_hdmi crct10dif_pclmul crc32_pclmul snd_hda_intel snd_hda_codec btusb snd_hda_core btrtl gpu_sched btbcm snd_hwdep btintel i2c_algo_bit ghash_clmulni_intel bluetooth ttm snd_pcm wmi_bmof mxm_wmi drxk aesni_intel mei_hdcp snd_timer drm_kms_helper gpio_ich iTCO_wdt aes_x86_64 ddbridge iTCO_vendor_support ecdh_generic drm crypto_simd rfkill snd cryptd glue_helper agpgart intel_cstate dvb_core ecc soundcore input_leds mousedev joydev videobuf2_vmalloc videobuf2_memops videobuf2_common syscopyarea sysfillrect sysimgblt fb_sys_fops 
loka 10 10:34:28 ArkkiVille kernel:  intel_uncore videodev intel_rapl_perf mc i2c_i801 e1000e lpc_ich evdev mei_me mei mac_hid wmi nfsd auth_rpcgss nfs_acl lockd grace sunrpc overlay sg crypto_user ip_tables x_tables ext4 crc32c_generic hid_steam crc16 mbcache jbd2 sd_mod hid_logitech_dj sr_mod cdrom hid_generic usbhid hid ahci libahci libata xhci_pci crc32c_intel scsi_mod ehci_pci xhci_hcd ehci_hcd bcache crc64 
loka 10 10:34:28 ArkkiVille kernel: CPU: 0 PID: 44225 Comm: FahCore_21 Tainted: G           OE     5.3.0-rc3-amd-staging-drm-next-git-00852-g2d9d0e0dc643 #1 
loka 10 10:34:28 ArkkiVille kernel: Hardware name: ASUS All Series/MAXIMUS VII GENE, BIOS 3503 04/18/2018 
loka 10 10:34:28 ArkkiVille kernel: RIP: 0010:amdgpu_ttm_backend_unbind+0x140/0x150 [amdgpu] 
loka 10 10:34:28 ArkkiVille kernel: Code: ef 48 39 c6 0f 85 f8 fe ff ff 80 3d 31 1b 41 00 00 0f 85 eb fe ff ff 48 c7 c7 a4 15 86 c1 c6 05 1d 1b 41 00 01 e8 2f 42 56 ee <0f> 0b e9 d1 fe ff ff 31 c0 eb cc 0f 1f 44 00 00 0f 1f 44 00 00 41 
loka 10 10:34:28 ArkkiVille kernel: RSP: 0018:ffff9e800bf63978 EFLAGS: 00010286 
loka 10 10:34:28 ArkkiVille kernel: RAX: 0000000000000000 RBX: ffff93d4c2b7c780 RCX: 0000000000000000 
loka 10 10:34:28 ArkkiVille kernel: RDX: 0000000000000001 RSI: 0000000000000086 RDI: 00000000ffffffff 
loka 10 10:34:28 ArkkiVille kernel: RBP: ffff93d4b3384f28 R08: 00000000000004db R09: 0000000000000001 
loka 10 10:34:28 ArkkiVille kernel: R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 
loka 10 10:34:28 ArkkiVille kernel: R13: ffff93d4c2b7c780 R14: ffff9e800bf63ac0 R15: ffff9e800bf63ac0 
loka 10 10:34:28 ArkkiVille kernel: FS:  00007f0357f46700(0000) GS:ffff93d4cea00000(0000) knlGS:0000000000000000 
loka 10 10:34:28 ArkkiVille kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 
loka 10 10:34:28 ArkkiVille kernel: CR2: 00007f317d381000 CR3: 00000002d07d4004 CR4: 00000000001606f0 
loka 10 10:34:28 ArkkiVille kernel: Call Trace: 
loka 10 10:34:28 ArkkiVille kernel:  ttm_tt_unbind+0x1d/0x30 [ttm] 
loka 10 10:34:28 ArkkiVille kernel:  ttm_bo_move_ttm+0x46/0x150 [ttm] 
loka 10 10:34:28 ArkkiVille kernel:  ttm_bo_handle_move_mem+0x583/0x5a0 [ttm] 
loka 10 10:34:28 ArkkiVille kernel:  ttm_bo_validate+0x13d/0x150 [ttm] 
loka 10 10:34:28 ArkkiVille kernel:  amdgpu_cs_list_validate+0x8e/0x150 [amdgpu] 
loka 10 10:34:28 ArkkiVille kernel:  amdgpu_cs_ioctl+0xd68/0x1d20 [amdgpu] 
loka 10 10:34:28 ArkkiVille kernel:  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu] 
loka 10 10:34:28 ArkkiVille kernel:  drm_ioctl_kernel+0xb3/0x100 [drm] 
loka 10 10:34:28 ArkkiVille kernel:  drm_ioctl+0x23d/0x3d0 [drm] 
loka 10 10:34:28 ArkkiVille kernel:  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu] 
loka 10 10:34:28 ArkkiVille kernel:  amdgpu_drm_ioctl+0x49/0x80 [amdgpu] 
loka 10 10:34:28 ArkkiVille kernel:  do_vfs_ioctl+0x43d/0x6c0 
loka 10 10:34:28 ArkkiVille kernel:  ? syscall_trace_enter+0x1f2/0x2e0 
loka 10 10:34:28 ArkkiVille kernel:  ksys_ioctl+0x5e/0x90 
loka 10 10:34:28 ArkkiVille kernel:  __x64_sys_ioctl+0x16/0x20 
loka 10 10:34:28 ArkkiVille kernel:  do_syscall_64+0x5f/0x1c0 
loka 10 10:34:28 ArkkiVille kernel:  ? prepare_exit_to_usermode+0x85/0xb0 
loka 10 10:34:28 ArkkiVille kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9 
loka 10 10:34:28 ArkkiVille kernel: RIP: 0033:0x7f0357a6a25b 
loka 10 10:34:28 ArkkiVille kernel: Code: 0f 1e fa 48 8b 05 25 9c 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f5 9b 0c 00 f7 d8 64 89 01 48 
loka 10 10:34:28 ArkkiVille kernel: RSP: 002b:00007f0357f45258 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 
loka 10 10:34:28 ArkkiVille kernel: RAX: ffffffffffffffda RBX: 00007f0357f452c0 RCX: 00007f0357a6a25b 
loka 10 10:34:28 ArkkiVille kernel: RDX: 00007f0357f452c0 RSI: 00000000c0186444 RDI: 0000000000000014 
loka 10 10:34:28 ArkkiVille kernel: RBP: 00000000c0186444 R08: 00007f0357f45370 R09: 0000000000000010 
loka 10 10:34:28 ArkkiVille kernel: R10: 00007f0357f45370 R11: 0000000000000246 R12: 00000000029504e0 
loka 10 10:34:28 ArkkiVille kernel: R13: 0000000000000014 R14: 00007f0340185890 R15: 000000000410f080 
loka 10 10:34:28 ArkkiVille kernel: ---[] end trace 2fceae28fe8e6841 ]--- 

But just to make sure, as I'm a bit confused: have I understood correctly, that this ebuild should result currently in a 5.3.0 -series kernel, but the amdgpu code is still newer (and the pkgver reflects this target version of the code)?

$ uname -r
5.3.0-rc3-amd-staging-drm-next-git-00852-g2d9d0e0dc643
$ pacman -Qs linux-amd-staging-drm-next-git
local/linux-amd-staging-drm-next-git 5.4.858456.2d9d0e0dc643-1
    The Linux-amd-staging-drm-next-git kernel and modules
local/linux-amd-staging-drm-next-git-headers 5.4.858456.2d9d0e0dc643-1
    Header files and scripts for building modules for Linux-amd-staging-drm-next-git kernel

If this is not correct, then I'm a bit confused as how one is supposed to use that ebuild. Should I edit the PKGBUILD somehow, and/or the Makefile?

(I've red the comments by pam-mroku and yurikoles in AUR...)

Last edited by Wild Penguin (2019-10-10 14:56:08)

Offline

#6 2019-10-10 18:42:29

loqs
Member
Registered: 2014-03-06
Posts: 17,375

Re: [SOLVED] AMDGPU Kernel module hangs when idle since (2019-09-27)

You are understanding the pkgver correctly.

Offline

#7 2019-10-14 13:49:37

Wild Penguin
Member
Registered: 2015-03-19
Posts: 320

Re: [SOLVED] AMDGPU Kernel module hangs when idle since (2019-09-27)

Ok, one (or two) more questions: what would you recommend, should I make a bug report in freedesktop.org ?

I was also thinking about making a git-bisect with the Linux Kernel. I've done it before (~10 years ago, don't remember what the bug was but it was fixed and I helped with the bisect IIRC :-) ), but in this case, the bug manifests only after a few hours, which can make things a bit more time consuming - except, in the case the unreferenced objects in kmemleak are correct (I mean, indicative of the bug) and don't appear on a correctly working kernel. I believe those appeared as soon as I started folding (though I haven't confirmed this, nor tested on an older Kernel yet). So, would a git-bisect help?

In case you think git-bisect is a good idea, which would be more helpful: should I use the main Kernel tree or https://cgit.freedesktop.org/~agd5f/linux/ for the bisect? I believe I could use either one - in https://cgit.freedesktop.org/~agd5f/linux/refs/tags I presume 5.2 should work, and in main Kernel, I'd start somewhere <5.3.1 ...

Offline

#8 2019-11-10 17:47:42

Wild Penguin
Member
Registered: 2015-03-19
Posts: 320

Re: [SOLVED] AMDGPU Kernel module hangs when idle since (2019-09-27)

Hi!

For some reason the leaks just don't happen anymore. I was planning on doing a bisect, but never got the time to do it and seems whatever it was, is now fixed smile.

Perpahs there was some problems (caused by some work / refactoring being done in the driver???) but they were fixed in a more recent Kernel (Currently running 5.3.8).

Markin as [SOLVED]! (will reopen if the leaks resurface and in case I see additional discussion here appropriate; I will also report / close the Kernel bugzilla bug report soon, just will wait a few days to make sure, so I don't generate unnecessary noise in there because of reopens)

Last edited by Wild Penguin (2019-11-10 17:49:01)

Offline

Board footer

Powered by FluxBB