You are not logged in.

#1 2015-10-25 18:40:26

Dennis
Member
Registered: 2014-11-04
Posts: 56
Website

[it's dead jim] 1x hangup and 2x seemingly random visual glitches

Starting today...

1) when I booted this morning, instead of my usual lxdm login screen, all I saw was a black screen and a few tiny white rectangles sprinkled across the screen, input devices seemed inactive (e.g. toggling keyboard LEDs was not possible anymore)

After a hard reset, I could use the machine the whole day without any incidents and let it shutdown regularly in the late afternoon.

Later booted it again...

2) could login normally but shortly after login, visual glitches, similar to the ones in the morning, additionally some blinking rectangles and what looked like random memory bits being streamed through some screen areas, appeared
this time however, input devices still worked and after switching to a text-mode terminal via CTRL+ALT+F2, lxdm seemed to reset itself automatically and I could login normally again

I still rebooted afterwards and have been using the machine normally and without incidents again up until now where I'm writing this (it still runs normally).

I upgraded linux-ck-core2 and the corresponding nvidia module today (as part of a full system upgrade):
linux-ck-core2 4.1.11-1
nvidia-ck-core2 355.11-4

Also made a full system upgrade a day or two ago.

Messages from around both incidents, see below.

Need I be worried that my hardware might be failing soon or is this more likely a software problem(I suspect software due to the absence of the problem before the upgrade but how can I be sure)?

info from journalctl for 1):
Okt 25 03:46:12 DB2arch kernel:  [<ffffffff810963a0>] ? kthread_worker_fn+0x170/0x170
Okt 25 03:46:12 DB2arch kernel:  [<ffffffff81573362>] ? ret_from_fork+0x42/0x70
Okt 25 03:46:12 DB2arch kernel:  [<ffffffff810963a0>] ? kthread_worker_fn+0x170/0x170
Okt 25 03:46:12 DB2arch kernel:  [<ffffffff81096478>] ? kthread+0xd8/0xf0
Okt 25 03:46:12 DB2arch kernel:  [<ffffffff81090950>] ? process_one_work+0x470/0x470
Okt 25 03:46:12 DB2arch kernel:  [<ffffffff81090950>] ? process_one_work+0x470/0x470
Okt 25 03:46:12 DB2arch kernel:  [<ffffffff81090998>] ? worker_thread+0x48/0x4c0
Okt 25 03:46:12 DB2arch kernel:  [<ffffffff8109062b>] ? process_one_work+0x14b/0x470
Okt 25 03:46:12 DB2arch kernel:  [<ffffffffa0088f36>] ? os_execute_work_item+0x46/0x70 [nvidia]
Okt 25 03:46:12 DB2arch kernel:  [<ffffffff811a4600>] ? kmem_cache_alloc+0x130/0x230
Okt 25 03:46:12 DB2arch kernel:  [<ffffffffa0540679>] ? rm_execute_work_item+0x49/0xc0 [nvidia]
Okt 25 03:46:12 DB2arch kernel:  [<ffffffffa053d6f9>] ? _nv000677rm+0x229/0xb10 [nvidia]
Okt 25 03:46:12 DB2arch kernel:  [<ffffffffa039c54d>] ? _nv013256rm+0x14d/0x260 [nvidia]
Okt 25 03:46:12 DB2arch kernel:  [<ffffffffa04eb0dd>] ? _nv014286rm+0x3d/0x120 [nvidia]
Okt 25 03:46:12 DB2arch kernel:  [<ffffffffa0539b28>] _nv012718rm+0x18/0x30 [nvidia]
Okt 25 03:46:12 DB2arch kernel:  [<ffffffffa00889de>] os_acquire_mutex+0x3e/0x50 [nvidia]
Okt 25 03:46:12 DB2arch kernel:  [<ffffffff810a6244>] down+0x44/0x50
Okt 25 03:46:12 DB2arch kernel:  [<ffffffff810cee00>] ? current_kernel_time+0x50/0x50
Okt 25 03:46:12 DB2arch kernel:  [<ffffffff81570df8>] __down+0x78/0xc0
Okt 25 03:46:12 DB2arch kernel:  [<ffffffff81571d2c>] schedule_timeout+0x1bc/0x250
Okt 25 03:46:12 DB2arch kernel:  [<ffffffff8156f46c>] schedule+0x1c/0x30
Okt 25 03:46:12 DB2arch kernel: Call Trace:
Okt 25 03:46:12 DB2arch kernel:  0000000797b093ca ffff8800bb3b68d0 ffff8800bb3b1830 ffffffffffffffff
Okt 25 03:46:12 DB2arch kernel:  01ffffffffffffff 0000000000016200 ffffffff818124c0 ffff88023fc16200
Okt 25 03:46:12 DB2arch kernel:  ffff88023533fb38 ffffffffffffffff ffff8800bb3b6cb8 0000000000000000
Okt 25 03:46:12 DB2arch kernel: Workqueue: events os_execute_work_item [nvidia]
Okt 25 03:46:12 DB2arch kernel: kworker/0:9     D ffff88023533fb38     0   930      2 0x00000000
Okt 25 03:46:12 DB2arch kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Okt 25 03:46:12 DB2arch kernel:       Tainted: P           O    4.1.10-3-ck #1
Okt 25 03:46:12 DB2arch kernel: INFO: task kworker/0:9:930 blocked for more than 120 seconds.

info from journalctl for 2):
Okt 25 18:55:34 DB2arch kernel: NVRM: Xid (PCI:0000:01:00): 31, Ch 00000000, engmask 00000101, intr 10000000
Okt 25 18:55:30 DB2arch kernel: NVRM: Xid (PCI:0000:01:00): 31, Ch 00000001, engmask 00000101, intr 10000000
Okt 25 18:55:30 DB2arch kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ChID 0001, Class 00009097, Offset 00001614, Data 00000000
Okt 25 18:55:30 DB2arch kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x50de48=0x2e0009 0x50de50=0x4 0x50de44=0x1beff2 0x50de4c=0xf
Okt 25 18:55:30 DB2arch kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Global Exception on (GPC 1, TPC 3): Physical Multiple Warp Errors
Okt 25 18:55:30 DB2arch kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 3): Illegal Instruction Encoding
Okt 25 18:55:29 DB2arch kernel: NVRM: GPU at PCI:0000:01:00: GPU-f4330fde-4957-cab1-65f6-9ce826ffadbd

Last edited by Dennis (2015-10-27 07:29:07)

Offline

#2 2015-10-26 08:18:25

Dennis
Member
Registered: 2014-11-04
Posts: 56
Website

Re: [it's dead jim] 1x hangup and 2x seemingly random visual glitches

Just experienced the glitch as described under 2) in the first post again. Happened after ~3h of uptime. lxdm did reset itself again but the system did freeze briefly before that.

from journalctl for this incident:

Okt 26 09:11:35 DB2arch systemd-coredump[13599]: Process 408 (Xorg) of user 0 dumped core.
Okt 26 09:11:34 DB2arch rtkit-daemon[3710]: Supervising 1 threads of 1 processes of 1 users.
Okt 26 09:11:34 DB2arch systemd[3197]: Started Sound Service.
Okt 26 09:11:34 DB2arch rtkit-daemon[3710]: Successfully made thread 13616 of process 13616 (/usr/bin/pulseaudio) owned by '1000' high priority at nice level -11.
Okt 26 09:11:34 DB2arch systemd[3197]: Stopped Sound Service.
Okt 26 09:11:34 DB2arch acpid[343]: 1 client rule loaded
Okt 26 09:11:34 DB2arch systemd[3197]: pulseaudio.service: Service hold-off time over, scheduling restart.
Okt 26 09:11:34 DB2arch acpid[343]: client connected from 13603[0:0]
Okt 26 09:11:34 DB2arch systemd[3197]: pulseaudio.service: Failed with result 'exit-code'.
Okt 26 09:11:34 DB2arch acpid[343]: client 408[0:0] has disconnected
Okt 26 09:11:34 DB2arch systemd[3197]: pulseaudio.service: Unit entered failed state.
Okt 26 09:11:34 DB2arch acpid[343]: client 408[0:0] has disconnected
Okt 26 09:11:34 DB2arch systemd[3197]: pulseaudio.service: Main process exited, code=exited, status=1/FAILURE
Okt 26 09:11:35 DB2arch org.a11y.atspi.Registry[3778]: after 14265 requests (14265 known processed) with 0 events remaining.
Okt 26 09:11:34 DB2arch org.a11y.atspi.Registry[3778]: XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":0"
Okt 26 09:11:34 DB2arch lxdm-session[3173]: pam_unix(lxdm:session): session closed for user dennis
Okt 26 09:11:15 DB2arch kernel: NVRM: Xid (PCI:0000:01:00): 31, Ch 00000000, engmask 00000101, intr 10000000
Okt 26 09:11:11 DB2arch kernel: NVRM: Xid (PCI:0000:01:00): 32, Channel ID 00000001 intr 00008000
Okt 26 09:11:11 DB2arch kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ChID 0001, Class 00009097, Offset 0000238c, Data 0000ab00
Okt 26 09:11:11 DB2arch kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x50de48=0x2e0001 0x50de50=0x0 0x50de44=0x1beff2 0x50de4c=0xf
Okt 26 09:11:11 DB2arch kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 3): Stack Error
Okt 26 09:11:11 DB2arch kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x50ce48=0x2e0001 0x50ce50=0x0 0x50ce44=0x1beff2 0x50ce4c=0xf
Okt 26 09:11:11 DB2arch kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 1): Stack Error
Okt 26 09:11:11 DB2arch kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x50c648=0x2e0001 0x50c650=0x0 0x50c644=0x1beff2 0x50c64c=0xf
Okt 26 09:11:11 DB2arch kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 0): Stack Error
Okt 26 09:11:11 DB2arch kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x505648=0x2e0001 0x505650=0x0 0x505644=0x1beff2 0x50564c=0xf
Okt 26 09:11:11 DB2arch kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 2): Stack Error
Okt 26 09:11:11 DB2arch kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x504e48=0x2e0001 0x504e50=0x0 0x504e44=0x1beff2 0x504e4c=0xf
Okt 26 09:11:11 DB2arch kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 1): Stack Error
Okt 26 09:11:11 DB2arch kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x504648=0x2e0001 0x504650=0x0 0x504644=0x1beff2 0x50464c=0xf
Okt 26 09:11:11 DB2arch kernel: NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 0): Stack Error
Okt 26 09:11:11 DB2arch kernel: NVRM: GPU at PCI:0000:01:00: GPU-f4330fde-4957-cab1-65f6-9ce826ffadbd

Last edited by Dennis (2015-10-26 14:10:27)

Offline

#3 2015-10-26 08:27:28

WorMzy
Forum Moderator
From: Scotland
Registered: 2010-06-16
Posts: 11,868
Website

Re: [it's dead jim] 1x hangup and 2x seemingly random visual glitches

Please use code tags when pasting logs, output or config files.

https://wiki.archlinux.org/index.php/Fo … s_and_code

Thanks.


Sakura:-
Mobo: MSI MAG X570S TORPEDO MAX // Processor: AMD Ryzen 9 5950X @4.9GHz // GFX: AMD Radeon RX 5700 XT // RAM: 32GB (4x 8GB) Corsair DDR4 (@ 3000MHz) // Storage: 1x 3TB HDD, 6x 1TB SSD, 2x 120GB SSD, 1x 275GB M2 SSD

Making lemonade from lemons since 2015.

Offline

#4 2015-10-26 14:26:20

Dennis
Member
Registered: 2014-11-04
Posts: 56
Website

Re: [it's dead jim] 1x hangup and 2x seemingly random visual glitches

Ok (I usually do... misjudged those snippets as not too long to paste plain).

I'm writing this from another machine. The one which this thread is about sadly seems to have a failing video card.

The random garbage keeps appearing more frequently now whenever it(the video card) switches away from plain-text mode and along with the random visual garbage come freezes. I think I can safely rule out a driver problem, as I'm getting the same behavior when I boot Windows7 on that machine (well unless the Linux driver somehow left the card in a state from which it can't easily recover but that seems unlikely after a full power-down).

I opened the case, inspected all pieces, disconnected everything, removed all visible dust (as I suspected a heat problem), could not see any obviously blown parts, put everything back together, ran a full memtest86+ (no errors)... the glitches and freezes still remained.

So, I've shut it down now. I hope the video card is the only broken part I need to replace. The fact that memtest ran without error and that everything else (in text-mode) seems to work as usual somehow suggests it is just the video card. I hope.

Unfortunately I can't test the card in a different mainboard to be sure, nor do I have a suitable extra card to test in that mainboard to see if it might be the mainboard and I'm out of ideas on what to do to determine the root cause of the problem.

Thoughts?

Last edited by Dennis (2015-10-26 14:27:19)

Offline

#5 2015-10-26 21:37:59

Dennis
Member
Registered: 2014-11-04
Posts: 56
Website

Re: [it's dead jim] 1x hangup and 2x seemingly random visual glitches

Another very thorough cleaning later I still felt an urge for further investigation. It was just odd that the card would work without problems in text-mode and only start throwing glitches in graphics mode.

Also, booting from a GPARTED live stick, a minimal graphics mode also did not seem to be affected, so I had a hunch it might be caused by hardware accelerated modes.

So, I adjusted my xorg.conf to select the lowest possible performance level for PowerMizer like this:

    Option         "RegistryDwords" "PowerMizerEnable=0x1; PerfLevelSrc=0x2222; PowerMizerLevel=0x3; PowerMizerDefault=0x3; PowerMizerDefaultAC=0x3"
   
And I could start lxdm with that... free of glitches.

Looked at nvidia-settings now:
20151026_oddclockrate0jz3k.png
       
At first I thought the memory rate for the highest performance level looked way too high (twice of what the specs say) but I guess it displays the DDR value(?) so it seems to comply with the specs (http://www.gainward.com/main/product/vg … .pdf?s=235).

So... previously it was for whatever reason always running at the highest level (in Windows too, since the glitches appeared there as well) so it can't be a linux driver issue.

The question now is, how can I force the card to stay at the lowest performance level and not automatically switch up as apparantly the highest level (maybe) causes overheating issues? (that of course won't fix it for Windows but I rarely use that anyway)

Or should that memory transfer rate be 2100MHz? And also, from where does it read those settings for the clocks? Video BIOS? (must be from the "hardware" as it survives a reboot and it gets the same values under different systems) And why would the video BIOS suddenly have wrong values? Virus? Bit rot?

Where would I get a fixed BIOS and how would I flash it from Linux? Or what else should I do? Get a replacement card?

Questions... big_smile

Offline

#6 2015-10-27 07:28:03

Dennis
Member
Registered: 2014-11-04
Posts: 56
Website

Re: [it's dead jim] 1x hangup and 2x seemingly random visual glitches

Writing this from the other machine again. The video card started acting up even in the low performance level just a few minutes after booting today. So... I had to order a replacement. *sigh*

So... derp. Card and thread abandoned.

Offline

Board footer

Powered by FluxBB