You are not logged in.

#1 2019-03-26 15:11:11

feng
Member
Registered: 2019-03-26
Posts: 11

System suspend with kernel 5 in heavy GPU load case with Tensorflow

I did machine learning with Archlinux with two GTX 1080Ti using Tensorflow framework.

I noticed since Linux 5, there was a system suspend/freeze when working with heavy GPU load. Switching back to linux-lts, which is of 4.19, solves this problem.

Below is the dmesg outputs for your reference:

[  336.146259] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 236
[  336.665488] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c4000-0x000c7fff window]
[  336.665708] caller _nv000934rm+0x1bf/0x1f0 [nvidia] mapping multiple BARs
[  337.209850] BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
[  337.209934] #PF error: [normal kernel read fault]
[  337.209981] PGD 8000000eff0ee067 P4D 8000000eff0ee067 PUD 0 
[  337.210042] Oops: 0000 [#1] PREEMPT SMP PTI
[  337.210087] CPU: 1 PID: 1059 Comm: python3 Tainted: P           OE     5.0.3-arch1-1-ARCH #1
[  337.210163] Hardware name: HP HP Z8 G4 Workstation/81C7, BIOS P60 v01.61 06/18/2018
[  337.210550] RIP: 0010:nv_dma_map_peer+0xd0/0x160 [nvidia]
[  337.210606] Code: ce e8 b4 fd ff ff 48 8b 5c 24 10 65 48 33 1c 25 28 00 00 00 0f 85 8e 00 00 00 48 83 c4 18 5b 5d 41 5c c3 48 8b 05 58 cc b6 c8 <48> 83 78 40 00 75 ca 49 c1 e2 06 49 8b 78 10 48 89 e6 4b 8d 94 10
[  337.210771] RSP: 0018:ffffa5d3c73cb9a0 EFLAGS: 00010246
[  337.210822] RAX: 0000000000000000 RBX: ffffa2ff7f6ab170 RCX: 0000000000000010
[  337.210888] RDX: 0000000000000001 RSI: ffffa3009e5e4800 RDI: ffffa3009e5e0800
[  337.210954] RBP: 0000387fe0000000 R08: ffffa300b6873000 R09: 0000387fefffffff
[  337.211020] R10: 0000000000000001 R11: 0000000000010000 R12: 0000387fe0000000
[  337.211086] R13: ffffa3009e5e4800 R14: 000000009e5e0800 R15: ffffa2ff7f6ab148
[  337.211153] FS:  00007f5225e54600(0000) GS:ffffa300bfa40000(0000) knlGS:0000000000000000
[  337.211227] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  337.211281] CR2: 0000000000000040 CR3: 0000001030c12004 CR4: 00000000007606e0
[  337.211348] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  337.211414] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  337.211487] PKRU: 55555554
[  337.211517] Call Trace:
[  337.212078]  _nv029599rm+0x265/0x490 [nvidia]
[  337.212455]  ? _nv025113rm+0x17c/0x3f0 [nvidia]
[  337.213626]  ? _nv026023rm+0xd7/0x270 [nvidia]
[  337.214051]  ? _nv010048rm+0x122/0x1b0 [nvidia]
[  337.214489]  ? _nv010044rm+0xc7/0x300 [nvidia]
[  337.214922]  ? _nv009778rm+0x2a2/0x770 [nvidia]
[  337.215359]  ? _nv009779rm+0x2a1/0x500 [nvidia]
[  337.215786]  ? _nv029787rm+0x367/0x870 [nvidia]
[  337.216559]  ? _nv003495rm+0x4e/0x70 [nvidia]
[  337.217351]  ? _nv004155rm+0xd9/0x180 [nvidia]
[  337.218528]  ? _nv032634rm+0x48/0x90 [nvidia]
[  337.219949]  ? _nv006069rm+0x1ca/0x350 [nvidia]
[  337.221079]  ? _nv033943rm+0x3cc/0x5e0 [nvidia]
[  337.222457]  ? _nv033942rm+0x98/0xf0 [nvidia]
[  337.223532]  ? _nv032735rm+0xfc/0x270 [nvidia]
[  337.224582]  ? _nv032736rm+0x53/0x80 [nvidia]
[  337.225622]  ? _nv007020rm+0x4b/0x80 [nvidia]
[  337.226763]  ? _nv000935rm+0x46e/0x900 [nvidia]
[  337.227518]  ? _raw_spin_unlock_irqrestore+0x20/0x40
[  337.228260]  ? rm_ioctl+0x54/0xb0 [nvidia]
[  337.228267]  ? filemap_map_pages+0x185/0x380
[  337.229327]  ? nvidia_ioctl+0xb0/0x7c0 [nvidia]
[  337.229927]  ? nvidia_ioctl+0x5f0/0x7c0 [nvidia]
[  337.230500]  ? nvidia_frontend_unlocked_ioctl+0x3a/0x50 [nvidia]
[  337.230959]  ? do_vfs_ioctl+0xa4/0x630
[  337.231391]  ? handle_mm_fault+0x10a/0x250
[  337.231822]  ? ksys_ioctl+0x60/0x90
[  337.232246]  ? __x64_sys_ioctl+0x16/0x20
[  337.232669]  ? do_syscall_64+0x5b/0x170
[  337.233083]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  337.233509] Modules linked in: nvidia_uvm(OE) cfg80211 8021q garp mrp stp llc nvidia_drm(POE) nvidia_modeset(POE) snd_hda_codec_hdmi intel_rapl skx_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp nls_iso8859_1 nls_cp437 vfat fat kvm irqbypass nvidia(POE) crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper intel_cstate snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio drm_kms_helper snd_hda_intel drm snd_hda_codec mei_wdt hp_wmi wmi_bmof sparse_keymap iTCO_wdt rfkill iTCO_vendor_support intel_wmi_thunderbolt snd_hda_core psmouse snd_hwdep ofpart agpgart snd_pcm e1000e intel_uncore i40e ipmi_devintf cmdlinepart tpm_crb snd_timer ipmi_msghandler intel_rapl_perf snd syscopyarea intel_spi_pci pcspkr intel_spi sysfillrect sysimgblt spi_nor fb_sys_fops soundcore mei_me i2c_i801 mtd ucsi_acpi lpc_ich typec_ucsi ioatdma mei tpm_tis typec wmi dca tpm_tis_core tpm rng_core evdev pcc_cpufreq mac_hid ip_tables x_tables raid456
[  337.233555]  libcrc32c async_raid6_recov async_memcpy async_pq async_xor xor async_tx ext4 crc32c_generic crc16 mbcache jbd2 fscrypto uas usb_storage raid6_pq md_mod sr_mod cdrom sd_mod ahci libahci serio_raw atkbd libata libps2 xhci_pci scsi_mod crc32c_intel xhci_hcd i8042 serio
[  337.239425] CR2: 0000000000000040
[  337.239874] ---[ end trace c9c5803eae8f4459 ]---
[  337.245772] RIP: 0010:nv_dma_map_peer+0xd0/0x160 [nvidia]
[  337.246275] Code: ce e8 b4 fd ff ff 48 8b 5c 24 10 65 48 33 1c 25 28 00 00 00 0f 85 8e 00 00 00 48 83 c4 18 5b 5d 41 5c c3 48 8b 05 58 cc b6 c8 <48> 83 78 40 00 75 ca 49 c1 e2 06 49 8b 78 10 48 89 e6 4b 8d 94 10
[  337.247797] RSP: 0018:ffffa5d3c73cb9a0 EFLAGS: 00010246
[  337.248268] RAX: 0000000000000000 RBX: ffffa2ff7f6ab170 RCX: 0000000000000010
[  337.248749] RDX: 0000000000000001 RSI: ffffa3009e5e4800 RDI: ffffa3009e5e0800
[  337.249234] RBP: 0000387fe0000000 R08: ffffa300b6873000 R09: 0000387fefffffff
[  337.249725] R10: 0000000000000001 R11: 0000000000010000 R12: 0000387fe0000000
[  337.250217] R13: ffffa3009e5e4800 R14: 000000009e5e0800 R15: ffffa2ff7f6ab148
[  337.250741] FS:  00007f5225e54600(0000) GS:ffffa300bfa40000(0000) knlGS:0000000000000000
[  337.251244] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  337.251750] CR2: 0000000000000040 CR3: 0000001030c12004 CR4: 00000000007606e0
[  337.252264] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  337.252784] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  337.253298] PKRU: 55555554
[  429.507755] audit: type=1006 audit(1553247655.831:45): pid=1189 uid=0 old-auid=4294967295 auid=1000 tty=(none) old-ses=4294967295 ses=5 res=1
[  431.818034] BUG: unable to handle kernel paging request at ffffa5d3c73cbdb8
[  431.819994] #PF error: [normal kernel read fault]
[  431.821753] PGD 103f539067 P4D 103f539067 PUD 103f53a067 PMD 1035d92067 PTE 0
[  431.823592] Oops: 0000 [#2] PREEMPT SMP PTI
[  431.825364] CPU: 2 PID: 1216 Comm: nvidia-smi Tainted: P      D    OE     5.0.3-arch1-1-ARCH #1
[  431.827151] Hardware name: HP HP Z8 G4 Workstation/81C7, BIOS P60 v01.61 06/18/2018
[  431.829681] RIP: 0010:_nv006971rm+0x2c/0x330 [nvidia]
[  431.831003] Code: 48 85 d2 74 07 48 63 47 08 48 01 d0 48 8b 17 48 85 d2 75 16 e9 9d 02 00 00 0f 1f 44 00 00 48 8b 4a 10 48 85 c9 74 17 48 89 ca <48> 39 32 77 ef 0f 83 29 02 00 00 48 8b 4a 18 48 85 c9 75 e9 48 89
[  431.834433] RSP: 0018:ffffa5d3c8067d30 EFLAGS: 00010086
[  431.835563] RAX: ffffa5d3c8067db8 RBX: ffffa5d3c8067d60 RCX: 0000000000000000
[  431.836689] RDX: ffffa5d3c73cbdb8 RSI: 00000000000004c0 RDI: ffffffffc1e423d8
[  431.837797] RBP: ffffa2ff71ceaff0 R08: 0000000000000048 R09: 0000000000000060
[  431.838884] R10: 0000000000000000 R11: 0000000000000000 R12: 3f9293eb2400afdc
[  431.839951] R13: ffffa3008a25c5a0 R14: 000000000000001f R15: 0000000000000048
[  431.840781] FS:  00007f48680edb80(0000) GS:ffffa300bfa80000(0000) knlGS:0000000000000000
[  431.841508] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  431.842236] CR2: ffffa5d3c73cbdb8 CR3: 000000100d710004 CR4: 00000000007606e0
[  431.842972] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  431.843694] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  431.844398] PKRU: 55555554
[  431.845074] Call Trace:
[  431.846024]  ? _nv035395rm+0xf1/0x1d0 [nvidia]
[  431.846892]  ? rm_perform_version_check+0x35/0x150 [nvidia]
[  431.847570]  ? __kmalloc+0x188/0x210
[  431.848365]  ? nvidia_ioctl+0xd6/0x7c0 [nvidia]
[  431.849163]  ? nvidia_ioctl+0x630/0x7c0 [nvidia]
[  431.849952]  ? nvidia_frontend_unlocked_ioctl+0x3a/0x50 [nvidia]
[  431.850467]  ? do_vfs_ioctl+0xa4/0x630
[  431.850944]  ? syscall_trace_enter+0x1d3/0x2d0
[  431.851419]  ? ksys_ioctl+0x60/0x90
[  431.851884]  ? __x64_sys_ioctl+0x16/0x20
[  431.852349]  ? do_syscall_64+0x5b/0x170
[  431.852815]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  431.853266] Modules linked in: nvidia_uvm(OE) cfg80211 8021q garp mrp stp llc nvidia_drm(POE) nvidia_modeset(POE) snd_hda_codec_hdmi intel_rapl skx_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp nls_iso8859_1 nls_cp437 vfat fat kvm irqbypass nvidia(POE) crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper intel_cstate snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio drm_kms_helper snd_hda_intel drm snd_hda_codec mei_wdt hp_wmi wmi_bmof sparse_keymap iTCO_wdt rfkill iTCO_vendor_support intel_wmi_thunderbolt snd_hda_core psmouse snd_hwdep ofpart agpgart snd_pcm e1000e intel_uncore i40e ipmi_devintf cmdlinepart tpm_crb snd_timer ipmi_msghandler intel_rapl_perf snd syscopyarea intel_spi_pci pcspkr intel_spi sysfillrect sysimgblt spi_nor fb_sys_fops soundcore mei_me i2c_i801 mtd ucsi_acpi lpc_ich typec_ucsi ioatdma mei tpm_tis typec wmi dca tpm_tis_core tpm rng_core evdev pcc_cpufreq mac_hid ip_tables x_tables raid456
[  431.853300]  libcrc32c async_raid6_recov async_memcpy async_pq async_xor xor async_tx ext4 crc32c_generic crc16 mbcache jbd2 fscrypto uas usb_storage raid6_pq md_mod sr_mod cdrom sd_mod ahci libahci serio_raw atkbd libata libps2 xhci_pci scsi_mod crc32c_intel xhci_hcd i8042 serio
[  431.859440] CR2: ffffa5d3c73cbdb8
[  431.859919] ---[ end trace c9c5803eae8f445a ]---
[  431.866198] RIP: 0010:nv_dma_map_peer+0xd0/0x160 [nvidia]
[  431.866772] Code: ce e8 b4 fd ff ff 48 8b 5c 24 10 65 48 33 1c 25 28 00 00 00 0f 85 8e 00 00 00 48 83 c4 18 5b 5d 41 5c c3 48 8b 05 58 cc b6 c8 <48> 83 78 40 00 75 ca 49 c1 e2 06 49 8b 78 10 48 89 e6 4b 8d 94 10
[  431.868336] RSP: 0018:ffffa5d3c73cb9a0 EFLAGS: 00010246
[  431.868840] RAX: 0000000000000000 RBX: ffffa2ff7f6ab170 RCX: 0000000000000010
[  431.869348] RDX: 0000000000000001 RSI: ffffa3009e5e4800 RDI: ffffa3009e5e0800
[  431.869860] RBP: 0000387fe0000000 R08: ffffa300b6873000 R09: 0000387fefffffff
[  431.870379] R10: 0000000000000001 R11: 0000000000010000 R12: 0000387fe0000000
[  431.870886] R13: ffffa3009e5e4800 R14: 000000009e5e0800 R15: ffffa2ff7f6ab148
[  431.871396] FS:  00007f48680edb80(0000) GS:ffffa300bfa80000(0000) knlGS:0000000000000000
[  431.871913] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  431.872445] CR2: ffffa5d3c73cbdb8 CR3: 000000100d710004 CR4: 00000000007606e0
[  431.872970] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  431.873501] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  431.874030] PKRU: 55555554
[  431.874567] note: nvidia-smi[1216] exited with preempt_count 1
[  491.820627] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  491.821400] rcu:     2-...0: (1 ticks this GP) idle=2f6/1/0x4000000000000000 softirq=807/807 fqs=5983 last_accelerate: 9990/dffc, nonlazy_posted: 6, L.
[  491.823289] rcu:     (detected by 12, t=18002 jiffies, g=16101, q=50310)
[  491.824214] Sending NMI from CPU 12 to CPUs 2:
[  491.827127] NMI watchdog: Watchdog detected hard LOCKUP on cpu 2
[  491.827127] Modules linked in: nvidia_uvm(OE) cfg80211 8021q garp mrp stp llc nvidia_drm(POE) nvidia_modeset(POE) snd_hda_codec_hdmi intel_rapl skx_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp nls_iso8859_1 nls_cp437 vfat fat kvm irqbypass nvidia(POE) crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper intel_cstate snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio drm_kms_helper snd_hda_intel drm snd_hda_codec mei_wdt hp_wmi wmi_bmof sparse_keymap iTCO_wdt rfkill iTCO_vendor_support intel_wmi_thunderbolt snd_hda_core psmouse snd_hwdep ofpart agpgart snd_pcm e1000e intel_uncore i40e ipmi_devintf cmdlinepart tpm_crb snd_timer ipmi_msghandler intel_rapl_perf snd syscopyarea intel_spi_pci pcspkr intel_spi sysfillrect sysimgblt spi_nor fb_sys_fops soundcore mei_me i2c_i801 mtd ucsi_acpi lpc_ich typec_ucsi ioatdma mei tpm_tis typec wmi dca tpm_tis_core tpm rng_core evdev pcc_cpufreq mac_hid ip_tables x_tables raid456
[  491.827149]  libcrc32c async_raid6_recov async_memcpy async_pq async_xor xor async_tx ext4 crc32c_generic crc16 mbcache jbd2 fscrypto uas usb_storage raid6_pq md_mod sr_mod cdrom sd_mod ahci libahci serio_raw atkbd libata libps2 xhci_pci scsi_mod crc32c_intel xhci_hcd i8042 serio
[  491.827157] CPU: 2 PID: 1216 Comm: nvidia-smi Tainted: P      D    OE     5.0.3-arch1-1-ARCH #1
[  491.827158] Hardware name: HP HP Z8 G4 Workstation/81C7, BIOS P60 v01.61 06/18/2018
[  491.827158] RIP: 0010:native_queued_spin_lock_slowpath+0x66/0x1d0
[  491.827159] Code: 71 f0 0f ba 2b 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 03 30 e4 09 d0 a9 00 01 ff ff 75 4b 85 c0 74 0e 8b 03 84 c0 74 08 f3 90 <8b> 03 84 c0 75 f8 b8 01 00 00 00 66 89 03 5b 5d 41 5c c3 8b 37 81
[  491.827160] RSP: 0018:ffffa5d3c8067ce0 EFLAGS: 00000002
[  491.827160] RAX: 0000000000000101 RBX: ffffa300b446f3e8 RCX: 0000000000000000
[  491.827161] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffa300b446f3e8
[  491.827161] RBP: 0000000000000287 R08: 0000000000000000 R09: 0000000000000000
[  491.827162] R10: ffffce01ff3a3288 R11: 0000000000000010 R12: 0000000000000020
[  491.827162] R13: ffffffffc1e2f8c0 R14: ffffa3008a2ee400 R15: ffffa300affb5710
[  491.827163] FS:  0000000000000000(0000) GS:ffffa300bfa80000(0000) knlGS:0000000000000000
[  491.827163] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  491.827164] CR2: ffffa5d3c73cbdb8 CR3: 0000000ff580e005 CR4: 00000000007606e0
[  491.827164] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  491.827165] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  491.827165] PKRU: 55555554
[  491.827165] Call Trace:
[  491.827166]  _raw_spin_lock_irqsave+0x42/0x50
[  491.827166]  os_acquire_spinlock+0xe/0x20 [nvidia]
[  491.827167]  _nv031902rm+0xc/0x20 [nvidia]
[  491.827167]  ? _nv035395rm+0xdb/0x1d0 [nvidia]
[  491.827167]  ? rm_free_unused_clients+0x2d/0xa0 [nvidia]
[  491.827168]  ? preempt_count_add+0x79/0xb0
[  491.827168]  ? _raw_spin_lock_irqsave+0x20/0x50
[  491.827169]  ? nvidia_close+0x200/0x3a0 [nvidia]
[  491.827169]  ? nvidia_frontend_close+0x2a/0x40 [nvidia]
[  491.827169]  ? __fput+0xa2/0x1d0
[  491.827170]  ? task_work_run+0x8f/0xb0
[  491.827170]  ? do_exit+0x392/0xb70
[  491.827170]  ? ksys_ioctl+0x60/0x90
[  491.827171]  ? rewind_stack_do_exit+0x17/0x20
[  491.827171] NMI backtrace for cpu 2
[  491.827172] CPU: 2 PID: 1216 Comm: nvidia-smi Tainted: P      D    OE     5.0.3-arch1-1-ARCH #1
[  491.827172] Hardware name: HP HP Z8 G4 Workstation/81C7, BIOS P60 v01.61 06/18/2018
[  491.827173] RIP: 0010:native_queued_spin_lock_slowpath+0x66/0x1d0
[  491.827174] Code: 71 f0 0f ba 2b 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 03 30 e4 09 d0 a9 00 01 ff ff 75 4b 85 c0 74 0e 8b 03 84 c0 74 08 f3 90 <8b> 03 84 c0 75 f8 b8 01 00 00 00 66 89 03 5b 5d 41 5c c3 8b 37 81
[  491.827174] RSP: 0018:ffffa5d3c8067ce0 EFLAGS: 00000002
[  491.827175] RAX: 0000000000000101 RBX: ffffa300b446f3e8 RCX: 0000000000000000
[  491.827175] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffa300b446f3e8
[  491.827176] RBP: 0000000000000287 R08: 0000000000000000 R09: 0000000000000000
[  491.827176] R10: ffffce01ff3a3288 R11: 0000000000000010 R12: 0000000000000020
[  491.827177] R13: ffffffffc1e2f8c0 R14: ffffa3008a2ee400 R15: ffffa300affb5710
[  491.827177] FS:  0000000000000000(0000) GS:ffffa300bfa80000(0000) knlGS:0000000000000000
[  491.827178] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  491.827178] CR2: ffffa5d3c73cbdb8 CR3: 0000000ff580e005 CR4: 00000000007606e0
[  491.827179] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  491.827179] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  491.827179] PKRU: 55555554
[  491.827180] Call Trace:
[  491.827180]  _raw_spin_lock_irqsave+0x42/0x50
[  491.827180]  os_acquire_spinlock+0xe/0x20 [nvidia]
[  491.827181]  _nv031902rm+0xc/0x20 [nvidia]
[  491.827181]  ? _nv035395rm+0xdb/0x1d0 [nvidia]
[  491.827182]  ? rm_free_unused_clients+0x2d/0xa0 [nvidia]
[  491.827182]  ? preempt_count_add+0x79/0xb0
[  491.827182]  ? _raw_spin_lock_irqsave+0x20/0x50
[  491.827183]  ? nvidia_close+0x200/0x3a0 [nvidia]
[  491.827183]  ? nvidia_frontend_close+0x2a/0x40 [nvidia]
[  491.827184]  ? __fput+0xa2/0x1d0
[  491.827184]  ? task_work_run+0x8f/0xb0
[  491.827184]  ? do_exit+0x392/0xb70
[  491.827185]  ? ksys_ioctl+0x60/0x90
[  491.827185]  ? rewind_stack_do_exit+0x17/0x20
[  671.840970] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  671.842287] rcu:     2-...0: (1 ticks this GP) idle=2f6/1/0x4000000000000000 softirq=807/807 fqs=23958 last_accelerate: 9990/b2f2, nonlazy_posted: 6, L.
[  671.844973] rcu:     (detected by 10, t=72008 jiffies, g=16101, q=210236)
[  671.846341] Sending NMI from CPU 10 to CPUs 2:
[  671.849844] NMI backtrace for cpu 2
[  671.849846] CPU: 2 PID: 1216 Comm: nvidia-smi Tainted: P      D    OE     5.0.3-arch1-1-ARCH #1
[  671.849848] Hardware name: HP HP Z8 G4 Workstation/81C7, BIOS P60 v01.61 06/18/2018
[  671.849850] RIP: 0010:native_queued_spin_lock_slowpath+0x66/0x1d0
[  671.849853] Code: 71 f0 0f ba 2b 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 03 30 e4 09 d0 a9 00 01 ff ff 75 4b 85 c0 74 0e 8b 03 84 c0 74 08 f3 90 <8b> 03 84 c0 75 f8 b8 01 00 00 00 66 89 03 5b 5d 41 5c c3 8b 37 81
[  671.849855] RSP: 0018:ffffa5d3c8067ce0 EFLAGS: 00000002
[  671.849858] RAX: 0000000000000101 RBX: ffffa300b446f3e8 RCX: 0000000000000000
[  671.849860] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffa300b446f3e8
[  671.849861] RBP: 0000000000000287 R08: 0000000000000000 R09: 0000000000000000
[  671.849863] R10: ffffce01ff3a3288 R11: 0000000000000010 R12: 0000000000000020
[  671.849865] R13: ffffffffc1e2f8c0 R14: ffffa3008a2ee400 R15: ffffa300affb5710
[  671.849867] FS:  0000000000000000(0000) GS:ffffa300bfa80000(0000) knlGS:0000000000000000
[  671.849869] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  671.849870] CR2: ffffa5d3c73cbdb8 CR3: 0000000ff580e005 CR4: 00000000007606e0
[  671.849872] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  671.849874] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  671.849875] PKRU: 55555554
[  671.849876] Call Trace:
[  671.849878]  _raw_spin_lock_irqsave+0x42/0x50
[  671.849879]  os_acquire_spinlock+0xe/0x20 [nvidia]
[  671.849881]  _nv031902rm+0xc/0x20 [nvidia]
[  671.849882]  ? _nv035395rm+0xdb/0x1d0 [nvidia]
[  671.849884]  ? rm_free_unused_clients+0x2d/0xa0 [nvidia]
[  671.849885]  ? preempt_count_add+0x79/0xb0
[  671.849887]  ? _raw_spin_lock_irqsave+0x20/0x50
[  671.849888]  ? nvidia_close+0x200/0x3a0 [nvidia]
[  671.849890]  ? nvidia_frontend_close+0x2a/0x40 [nvidia]
[  671.849891]  ? __fput+0xa2/0x1d0
[  671.849892]  ? task_work_run+0x8f/0xb0
[  671.849894]  ? do_exit+0x392/0xb70
[  671.849895]  ? ksys_ioctl+0x60/0x90
[  671.849896]  ? rewind_stack_do_exit+0x17/0x20

Offline

#2 2019-03-26 16:21:04

loqs
Member
Registered: 2014-03-06
Posts: 17,378

Re: System suspend with kernel 5 in heavy GPU load case with Tensorflow

Offline

Board footer

Powered by FluxBB