Running blender keeps crashing my session

VoodooCode · 2024-04-23 15:36:39

Hi,

I've a blender script I'm running to render a number of images. However, I seemingly randomly get locked out of my session and see an xorg core dump. This happens if I render via CPU or GPU - doesn't matter. I've added sleep commands into the code which increase stability somewhat, hence I think that blender may overload/block? the CPU (regardless of render mode) momentarily which xorg may interpret as a short freeze, killing the session - but that's me guessing. Ideally I'd like to get rid of the issue, but being new to arch, my experience besides looking into reason for the session crash is limited.

This is the corresponding core dump:

Apr 23 11:21:14 voodoopc systemd-coredump[1249727]: [?] Process 972 (Xorg) of user 0 dumped core.
                                                    
                                                    Stack trace of thread 972:
                                                    #0  0x000071fb7f2f232c n/a (libc.so.6 + 0x8d32c)
                                                    #1  0x000071fb7f2a16c8 raise (libc.so.6 + 0x3c6c8)
                                                    #2  0x000071fb7f2894b8 abort (libc.so.6 + 0x244b8)
                                                    #3  0x0000557c068f3b00 OsAbort (Xorg + 0x14ab00)
                                                    #4  0x0000557c068f3e3b FatalError (Xorg + 0x14ae3b)
                                                    #5  0x0000557c068ebd46 n/a (Xorg + 0x142d46)
                                                    #6  0x000071fb7f2a1770 n/a (libc.so.6 + 0x3c770)
                                                    #7  0x000071fb7dee0fb9 n/a (nvidia_drv.so + 0xe0fb9)
                                                    #8  0x000071fb7dee191d n/a (nvidia_drv.so + 0xe191d)
                                                    #9  0x000071fb7dee52d9 n/a (nvidia_drv.so + 0xe52d9)
                                                    #10 0x000071fb7dee5459 n/a (nvidia_drv.so + 0xe5459)
                                                    #11 0x000071fb7dedcd8b n/a (nvidia_drv.so + 0xdcd8b)
                                                    #12 0x000071fb7dedfa98 n/a (nvidia_drv.so + 0xdfa98)
                                                    #13 0x000071fb7deeaeca n/a (nvidia_drv.so + 0xeaeca)
                                                    #14 0x000071fb7def0742 n/a (nvidia_drv.so + 0xf0742)
                                                    #15 0x000071fb7def189c n/a (nvidia_drv.so + 0xf189c)
                                                    #16 0x000071fb7def4562 n/a (nvidia_drv.so + 0xf4562)
                                                    #17 0x000071fb7dedaab9 n/a (nvidia_drv.so + 0xdaab9)
                                                    #18 0x000071fb7e2b0735 n/a (nvidia_drv.so + 0x4b0735)
                                                    ELF object binary architecture: AMD x86-64

These messages might relevant as well:

Apr 23 11:20:51 voodoopc rtkit-daemon[1122]: Supervising 1 threads of 1 processes of 1 users.
Apr 23 11:21:12 voodoopc kernel: NVRM: GPU at PCI:0000:0a:00: GPU-25654bf8-0702-6d02-8a48-16659eac90f6
Apr 23 11:21:12 voodoopc kernel: NVRM: Xid (PCI:0000:0a:00): 69, pid='<unknown>', name=<unknown>, Class Error: ChId 0009, Class 0000902d, Offs>
Apr 23 11:21:12 voodoopc rtkit-daemon[1122]: Supervising 1 threads of 1 processes of 1 users.
Apr 23 11:21:12 voodoopc systemd[1]: Started Process Core Dump (PID 1249726/UID 0).
Apr 23 11:21:12 voodoopc rtkit-daemon[1122]: Successfully made thread 1249728 of process 1110 owned by '1000' RT at priority 5.
Apr 23 11:21:12 voodoopc rtkit-daemon[1122]: Supervising 2 threads of 2 processes of 1 users.

Also, when terminating the within-blender script myself, I get the following error messages, followed by a blender crash - since the blender interface is in Python, I don't think this is a coding mistake on my end ...

Freeing memory after the leak detector has run. This can happen when using static variables in C++ that are defined outside of functions. To fix this error, use the 'construct on first use' idiom.

Last edited by VoodooCode (2024-04-23 20:04:40)

seth · 2024-04-23 19:43:44

Please use [code][/code] tags. Edit your post in this regard.

XID 69 is "Graphics Engine class error", hardware or driver issue
https://docs.nvidia.com/deploy/xid-errors/index.html

X11 aborts out of the nvidia driver, is there an error message in the Xorg log, https://wiki.archlinux.org/title/Xorg#General ?

I've added sleep commands into the code which increase stability somewhat

Temperature issue?

Semi-off topic:

since the blender interface is in Python, I don't think this is a coding mistake on my end ...

The error is probably unrelated to the X11 crash but the message indicates some late clean-up, depending on what your script looks like of course this can be an error between your keyboard and chair.

VoodooCode · 2024-04-23 20:04:08

Thanks for taking a look!

I've checked GPU and CPU temperatures, GPU stays unaffected, below 50°C, (rendering via CPU) and CPU stays below 65°C. On another note, if it was indeed a temperature issue, wouldn't the PC shut down completely to avoid heat-related damages?

I'm gonna run the Python code in Blender, this time in hopes of creating a crash, and report back on the xorg log.

loqs · 2024-04-23 20:21:41

VoodooCode wrote:

On another note, if it was indeed a temperature issue, wouldn't the PC shut down completely to avoid heat-related damages?

Thermal related instability can occur below both the shutdown and down clock temperatures of the CPU/GPU.

Last edited by loqs (2024-04-23 20:23:09)

VoodooCode · 2024-04-23 20:50:21

loqs wrote:

VoodooCode wrote:
On another note, if it was indeed a temperature issue, wouldn't the PC shut down completely to avoid heat-related damages?
Thermal related instability can occur below both the shutdown and down clock temperatures of the CPU/GPU.

That's good to know, wasn't aware of that!

Anyhow, I just created another crash with a couple of interesting observations

1), the requested xorg output

[  9427.104] (EE) 
[  9427.104] (EE) Backtrace:
[  9427.105] (EE) unw_get_proc_name failed: no unwind info found [-10]
[  9427.105] (EE) 0: /usr/lib/Xorg (?+0x0) [0x5dd5eb6f3ced]
[  9427.106] (EE) unw_get_proc_name failed: no unwind info found [-10]
[  9427.106] (EE) 1: /usr/lib/libc.so.6 (?+0x0) [0x73ad4d27d770]
[  9427.107] (EE) unw_get_proc_name failed: no unwind info found [-10]
[  9427.107] (EE) 2: /usr/lib/xorg/modules/drivers/nvidia_drv.so (?+0x0) [0x73ad4bee0fb9]
[  9427.107] (EE) unw_get_proc_name failed: no unwind info found [-10]
[  9427.107] (EE) 3: /usr/lib/xorg/modules/drivers/nvidia_drv.so (?+0x0) [0x73ad4bee191d]
[  9427.107] (EE) unw_get_proc_name failed: no unwind info found [-10]
[  9427.107] (EE) 4: /usr/lib/xorg/modules/drivers/nvidia_drv.so (?+0x0) [0x73ad4bee52d9]
[  9427.107] (EE) unw_get_proc_name failed: no unwind info found [-10]
[  9427.107] (EE) 5: /usr/lib/xorg/modules/drivers/nvidia_drv.so (?+0x0) [0x73ad4bee5459]
[  9427.107] (EE) unw_get_proc_name failed: no unwind info found [-10]
[  9427.107] (EE) 6: /usr/lib/xorg/modules/drivers/nvidia_drv.so (?+0x0) [0x73ad4bedcd8b]
[  9427.108] (EE) unw_get_proc_name failed: no unwind info found [-10]
[  9427.108] (EE) 7: /usr/lib/xorg/modules/drivers/nvidia_drv.so (?+0x0) [0x73ad4bedfa98]
[  9427.108] (EE) unw_get_proc_name failed: no unwind info found [-10]
[  9427.108] (EE) 8: /usr/lib/xorg/modules/drivers/nvidia_drv.so (?+0x0) [0x73ad4beeaeca]
[  9427.108] (EE) unw_get_proc_name failed: no unwind info found [-10]
[  9427.108] (EE) 9: /usr/lib/xorg/modules/drivers/nvidia_drv.so (?+0x0) [0x73ad4bef0742]
[  9427.108] (EE) unw_get_proc_name failed: no unwind info found [-10]
[  9427.108] (EE) 10: /usr/lib/xorg/modules/drivers/nvidia_drv.so (?+0x0) [0x73ad4bef1c4f]
[  9427.108] (EE) unw_get_proc_name failed: no unwind info found [-10]
[  9427.108] (EE) 11: /usr/lib/xorg/modules/drivers/nvidia_drv.so (?+0x0) [0x73ad4bef4562]
[  9427.109] (EE) unw_get_proc_name failed: no unwind info found [-10]
[  9427.109] (EE) 12: /usr/lib/xorg/modules/drivers/nvidia_drv.so (?+0x0) [0x73ad4bedaab9]
[  9427.109] (EE) unw_get_proc_name failed: no unwind info found [-10]
[  9427.109] (EE) 13: /usr/lib/xorg/modules/drivers/nvidia_drv.so (?+0x0) [0x73ad4c2b0735]
[  9427.109] (EE) 
[  9427.109] (EE) Segmentation fault at address 0x34
[  9427.109] (EE) 
Fatal server error:
[  9427.109] (EE) Caught signal 11 (Segmentation fault). Server aborting
[  9427.109] (EE) 
[  9427.109] (EE) 
Please consult the The X.Org Foundation support 
	 at http://wiki.x.org
 for help. 
[  9427.109] (EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information.
[  9427.109] (EE) 
[  9427.310] (EE) Server terminated with error (1). Closing log file.

2) To help *push* the crash, I had some computation running in the background. When the session reset, the IDE died (as expected). However, the computation stayed alive and finished. Would have expected for all processes, especially those with 100% CPU utilization to get killed but ... apparently they didn't ...

Regarding temperature, CPU went up to 70°C with split second peaks at 75°C, GPU never went above 50°C.

seth · 2024-04-23 21:06:05

That's just the backtrace (likwise w/o symbols, but you're not getting them from the binary nvidia driver anyway, resp. they'd be obfuscated)
I had hoped that b/c of the abort the server had issued some warning beforehand?

Would have expected for all processes, especially those with 100% CPU utilization to get killed but ... apparently they didn't ...

Nothing gets killed when the X11 server crashes, but since X11 clients more or less critically depend on it, they'll typically terminate and if you run eg. a process inside a GUI terminal, that process will die with it's terminal.
You can however run a process in tmux (which is not attached to the X11 server or session management) and zap the X11 server, restart, start a TE, re-attach to the running tmux and tmux and the process inside will be fine. Likewise process on other VTs are completely unaffected.

VoodooCode · 2024-04-23 21:09:25

There's some messages before, but they look like initialization steps - also seem to have a considerably different timestamp.

[    26.991] (--) NVIDIA(GPU-0): DFP-7: disconnected
[    26.991] (--) NVIDIA(GPU-0): DFP-7: Internal TMDS
[    26.991] (--) NVIDIA(GPU-0): DFP-7: 165.0 MHz maximum pixel clock
[    26.991] (--) NVIDIA(GPU-0): 
[  9427.104] (EE) 
[  9427.104] (EE) Backtrace:
[  9427.105] (EE) unw_get_proc_name failed: no unwind info found [-10]
[  9427.105] (EE) 0: /usr/lib/Xorg (?+0x0) [0x5dd5eb6f3ced]

Funny part is, this crash only happens with blender. If I run other programs 100% the CPU for several hours, nothing happens, but with Blender, stuff may crash in < 30 min.

Last edited by VoodooCode (2024-04-23 21:10:40)

seth · 2024-04-23 21:24:14

If I run other programs 100% the CPU for several hours, nothing happens

https://wiki.archlinux.org/title/Benchmarking#Graphics
https://wiki.archlinux.org/title/Benchm … ine_Engine
https://wiki.archlinux.org/title/Benchmarking#vkmark ?

Since the backtrace is out of the nvidia X11 driver in response to an nvidia kernel module XID, blender will do *something* w/ the GPU no matter what.
Potentially https://wiki.archlinux.org/title/GPGPU#CUDA ?
Can you
1. run your script
2. w/o crashes
when blacklisting nvidia_uvm?
https://wiki.archlinux.org/title/Kernel … and_line_2

VoodooCode · 2024-04-23 22:07:19

Ran the benchmarks:
Unigine Engine - valley worked flawlessly
vkmark also worked flawlessly
blender-benchmark failed with the following error:

BLENDER: Error: Illegal address in CUDA queue copy_from_device (integrator_init_from_bake integrator_shade_surface integrator_sorted_paths_array)

An error I've seen quite often in Blender proper when rendering large scenes on the GPU.

Will try the other test later in the day.

Edit1:
Also ran unigine-superposition at max settings with 20FPS - flawlessly and @ 80°C (GPU)

Added a file to /etc/modprobe.d/, blacklisting nvidia_uvm & rebooted - test currently running.

Edit2: Blender (& the xserver) crashed, after blacklisting nvidia_uvm

Last edited by VoodooCode (2024-04-23 23:25:31)

seth · 2024-04-24 07:59:53

Since https://archlinux.org/packages/extra/x86_64/cuda/ is an optional dependency of blender, have you tried what happens if you remove that?

Nikolai5 · 2024-04-24 10:34:47

May not relate to your issues, but I was facing system instability issues a bit back, it seemed to affect me most when I was encoding video using ffmpeg.

In the end I removed my CPU overclock and also disabled c-states in the BIOS. After that I've not had a single crash. But it was c-states that were causing my PC to crash during ffmpeg encoding. Not saying that's the same issue for you, but if you're at wit's end then might be worth looking into.

Arch Linux

#1 2024-04-23 15:36:39

Running blender keeps crashing my session

#2 2024-04-23 19:43:44

Re: Running blender keeps crashing my session

#3 2024-04-23 20:04:08

Re: Running blender keeps crashing my session

#4 2024-04-23 20:21:41

Re: Running blender keeps crashing my session

#5 2024-04-23 20:50:21

Re: Running blender keeps crashing my session

#6 2024-04-23 21:06:05

Re: Running blender keeps crashing my session

#7 2024-04-23 21:09:25

Re: Running blender keeps crashing my session

#8 2024-04-23 21:24:14

Re: Running blender keeps crashing my session

#9 2024-04-23 22:07:19

Re: Running blender keeps crashing my session

#10 2024-04-24 07:59:53

Re: Running blender keeps crashing my session

#11 2024-04-24 10:34:47

Re: Running blender keeps crashing my session

Board footer