You are not logged in.

#1 2024-01-22 16:16:01

scalercio
Member
Registered: 2024-01-22
Posts: 2

Kernel Bug while running a python deep learning training script

My Python script keeps crashing at random moments during the network training.
Here are some information about my environment, some outputs of commands and the two last log messages. The system is updated as well as my Bios but the problem keeps happening. The kernel is tainted.

[arthur@archlinux ~]$ cat /proc/sys/kernel/tainted
12416

Kernel version

[arthur@archlinux ~]$ uname -r
6.7.0-arch3-1

Log Message 1

jan 21 22:53:31 archlinux kernel: BUG: kernel NULL pointer dereference, address: 0000000000000018
jan 21 22:53:31 archlinux kernel: #PF: supervisor write access in kernel mode
jan 21 22:53:31 archlinux kernel: #PF: error_code(0x0002) - not-present page
jan 21 22:53:31 archlinux kernel: PGD 1f003f067 P4D 1f003f067 PUD 0 
jan 21 22:53:31 archlinux kernel: Oops: 0002 [#1] PREEMPT SMP NOPTI
jan 21 22:53:31 archlinux kernel: CPU: 8 PID: 1244 Comm: python Tainted: P           OE      6.7.0-arch3-1 #1 29ada86f174bb9983ea57568622d66509982ed7e
jan 21 22:53:31 archlinux kernel: Hardware name: Gigabyte Technology Co., Ltd. Z790 AERO G/Z790 AERO G, BIOS F10 12/14/2023
jan 21 22:53:31 archlinux kernel: RIP: 0010:entry_SYSCALL_64_after_hwframe+0x74/0x76
jan 21 22:53:31 archlinux kernel: Code: 8b 14 25 90 0b 02 00 89 d0 48 c1 ea 20 0f 30 eb 0e cc cc cc cc cc cc cc cc cc cc cc cc cc cc e8 16 33 fb ff 84 c0 0f 84 1e 17 <00> 00 eb 19 b9 48 00 00 00 65 48 8b 14 25 90 0b 02 00 83 e2 fe 89
jan 21 22:53:31 archlinux kernel: RSP: 0018:ffff99c70de77f58 EFLAGS: 00010087
jan 21 22:53:31 archlinux kernel: RAX: 0000000000000018 RBX: 0000000000000000 RCX: 0000000000000000
jan 21 22:53:31 archlinux kernel: RDX: 0000000000000000 RSI: 0000000000000018 RDI: ffff99c70de77f58
jan 21 22:53:31 archlinux kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
jan 21 22:53:31 archlinux kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
jan 21 22:53:31 archlinux kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
jan 21 22:53:31 archlinux kernel: FS:  00007e53abe006c0(0000) GS:ffff8b90bf600000(0000) knlGS:0000000000000000
jan 21 22:53:31 archlinux kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
jan 21 22:53:31 archlinux kernel: CR2: 0000000000000018 CR3: 0000000105d48000 CR4: 0000000000f50ef0
jan 21 22:53:31 archlinux kernel: PKRU: 55555554
jan 21 22:53:31 archlinux kernel: Call Trace:
jan 21 22:53:31 archlinux kernel: Modules linked in: snd_seq_dummy snd_hrtimer snd_seq snd_seq_device joydev mousedev btusb btrtl btintel btbcm btmtk bluetooth uas usbhid usb_storage ecdh_generic nvidia_drm(POE) nvidia_modeset(POE) inte>
jan 21 22:53:31 archlinux kernel:  i915 libarc4 cryptd snd_intel_sdw_acpi snd_hda_codec snd_hda_core drm_buddy rapl i2c_algo_bit snd_hwdep iwlwifi ttm snd_pcm iTCO_wdt mei_hdcp mei_pxp intel_pmc_bxt drm_display_helper snd_timer vfat int>
jan 21 22:53:31 archlinux kernel: CR2: 0000000000000018
jan 21 22:53:31 archlinux kernel: ---[ end trace 0000000000000000 ]---
jan 21 22:53:31 archlinux kernel: RIP: 0010:entry_SYSCALL_64_after_hwframe+0x74/0x76
jan 21 22:53:31 archlinux kernel: Code: 8b 14 25 90 0b 02 00 89 d0 48 c1 ea 20 0f 30 eb 0e cc cc cc cc cc cc cc cc cc cc cc cc cc cc e8 16 33 fb ff 84 c0 0f 84 1e 17 <00> 00 eb 19 b9 48 00 00 00 65 48 8b 14 25 90 0b 02 00 83 e2 fe 89
jan 21 22:53:31 archlinux kernel: RSP: 0018:ffff99c70de77f58 EFLAGS: 00010087
jan 21 22:53:31 archlinux kernel: RAX: 0000000000000018 RBX: 0000000000000000 RCX: 0000000000000000
jan 21 22:53:31 archlinux kernel: RDX: 0000000000000000 RSI: 0000000000000018 RDI: ffff99c70de77f58
jan 21 22:53:31 archlinux kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
jan 21 22:53:31 archlinux kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
jan 21 22:53:31 archlinux kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
jan 21 22:53:31 archlinux kernel: FS:  00007e53abe006c0(0000) GS:ffff8b90bf600000(0000) knlGS:0000000000000000
jan 21 22:53:31 archlinux kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
jan 21 22:53:31 archlinux kernel: CR2: 0000000000000018 CR3: 0000000105d48000 CR4: 0000000000f50ef0
jan 21 22:53:31 archlinux kernel: PKRU: 55555554
jan 21 22:53:31 archlinux kernel: note: python[1244] exited with irqs disabled 

Log Message 2

jan 22 01:46:18 archlinux kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
jan 22 01:46:18 archlinux kernel: #PF: supervisor instruction fetch in kernel mode
jan 22 01:46:18 archlinux kernel: #PF: error_code(0x0010) - not-present page
jan 22 01:46:18 archlinux kernel: PGD 17ba40067 P4D 17ba40067 PUD 0 
jan 22 01:46:18 archlinux kernel: Oops: 0010 [#1] PREEMPT SMP NOPTI
jan 22 01:46:18 archlinux kernel: CPU: 8 PID: 1244 Comm: python Tainted: G           OE      6.7.0-arch3-1 #1 29ada86f174bb9983ea57568622d66509982ed7e
jan 22 01:46:18 archlinux kernel: Hardware name: Gigabyte Technology Co., Ltd. Z790 AERO G/Z790 AERO G, BIOS F10 12/14/2023
jan 22 01:46:18 archlinux kernel: RIP: 0010:0x0
jan 22 01:46:18 archlinux kernel: Code: Unable to access opcode bytes at 0xffffffffffffffd6.
jan 22 01:46:18 archlinux kernel: RSP: 0018:ffffaaf0c7817f60 EFLAGS: 00010046
jan 22 01:46:18 archlinux kernel: RAX: 00007aa681f67401 RBX: 0000000000000000 RCX: 00007ffffffff000
jan 22 01:46:18 archlinux kernel: RDX: 0000000000000246 RSI: ffff93a73f634d30 RDI: ffffaaf0c7817f58
jan 22 01:46:18 archlinux kernel: RBP: 0000000000000000 R08: ffff93a73f634d30 R09: 0000000000000002
jan 22 01:46:18 archlinux kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
jan 22 01:46:18 archlinux kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
jan 22 01:46:18 archlinux kernel: FS:  00007aa4f0c006c0(0000) GS:ffff93a73f600000(0000) knlGS:0000000000000000
jan 22 01:46:18 archlinux kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
jan 22 01:46:18 archlinux kernel: CR2: ffffffffffffffd6 CR3: 000000011bb44000 CR4: 0000000000f50ef0
jan 22 01:46:18 archlinux kernel: PKRU: 55555554
jan 22 01:46:18 archlinux kernel: Call Trace:
jan 22 01:46:18 archlinux kernel:  <TASK>
jan 22 01:46:18 archlinux kernel:  ? __die+0x23/0x70
jan 22 01:46:18 archlinux kernel:  ? page_fault_oops+0x171/0x4e0
jan 22 01:46:18 archlinux kernel:  ? exc_page_fault+0x7f/0x180
jan 22 01:46:18 archlinux kernel:  ? asm_exc_page_fault+0x26/0x30
jan 22 01:46:18 archlinux kernel:  </TASK>
jan 22 01:46:18 archlinux kernel: Modules linked in: snd_seq_dummy snd_hrtimer snd_seq snd_seq_device intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common snd_sof_pci_intel_tgl snd_sof_intel_hda_common s>
jan 22 01:46:18 archlinux kernel:  iTCO_wdt iwlwifi snd_pcm drm_display_helper usbhid vfat ecdh_generic mei_pxp mei_hdcp intel_cstate intel_pmc_bxt nvidia_modeset(OE) iTCO_vendor_support wmi_bmof gigabyte_wmi cfg80211 spi_nor cec snd_ti>
jan 22 01:46:18 archlinux kernel: CR2: 0000000000000000
jan 22 01:46:18 archlinux kernel: ---[ end trace 0000000000000000 ]---
jan 22 01:46:18 archlinux kernel: RIP: 0010:0x0
jan 22 01:46:18 archlinux kernel: Code: Unable to access opcode bytes at 0xffffffffffffffd6.
jan 22 01:46:18 archlinux kernel: RSP: 0018:ffffaaf0c7817f60 EFLAGS: 00010046
jan 22 01:46:18 archlinux kernel: RAX: 00007aa681f67401 RBX: 0000000000000000 RCX: 00007ffffffff000
jan 22 01:46:18 archlinux kernel: RDX: 0000000000000246 RSI: ffff93a73f634d30 RDI: ffffaaf0c7817f58
jan 22 01:46:18 archlinux kernel: RBP: 0000000000000000 R08: ffff93a73f634d30 R09: 0000000000000002
jan 22 01:46:18 archlinux kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
jan 22 01:46:18 archlinux kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
jan 22 01:46:18 archlinux kernel: FS:  00007aa4f0c006c0(0000) GS:ffff93a73f600000(0000) knlGS:0000000000000000
jan 22 01:46:18 archlinux kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
jan 22 01:46:18 archlinux kernel: CR2: ffffffffffffffd6 CR3: 000000011bb44000 CR4: 0000000000f50ef0
jan 22 01:46:18 archlinux kernel: PKRU: 55555554
jan 22 01:46:18 archlinux kernel: note: python[1244] exited with irqs disabled

Offline

#2 2024-01-23 14:40:11

scalercio
Member
Registered: 2024-01-22
Posts: 2

Re: Kernel Bug while running a python deep learning training script

I created a new Python 3.10 environment with the latest version of Python (3.10.13) and reinstalled all the packages. The crashes are gone so far!

Offline

Board footer

Powered by FluxBB