You are not logged in.

#1 2024-05-04 22:53:07

alex.gt
Member
Registered: 2022-12-08
Posts: 8

[solved] crash or unresponsive, nvme uefi culprits?

Greetings, folks.
First of all I'd to say I have no logs because the system didn't recorded any entry after the weird events. Thank you very much for your attention.
After 10 months smoothly running on a daily basis (turn off at the end of the day), my new machine had a very weird behaviour. That was on about 2024 april 10th.
On may 2nd it happened again but not exactly in the same manner.

When it happened the first time I was coding and nothing really heavy was running AFAIK. Some music player, two IDEs, libreoffice spreadsheets, note taking app, VPN, some terminals, firefox and chrome, ones which I use every single work day.
All of a sudden things stopped responding as expected. One of the IDEs closed after I asked for a search. Tried a simple ping and nothing. Then started to get some /usr/bin/bash not found messages if I remember well.

The second time it happened the machine was unattended. When I came by and saw the screen it was't in the way I let it. It was showing the usual verbosity of an X end of session. The cursor was blinking but not responding and I couldn't open terminals with ctrl+alt+Fi.

On both times after hard restarting I did learned that the UEFI boot entry went missing because I was greeted with the UEFI asking for a password (secure boot enabled).

On the second time I just pushed the reset button. After some UEFI tweaking to boot from an USB drive loaded with Arch Linux, as soon as the commands entry was available I could not see the nvme device I was used to. It looked like having one partition with 2 GB and an identifier which I was not used to also, some MN-5520 I think. Unfortunately I didn't took pictures. To my relief after a shutdown and a fresh USB boot the nvme device started showing with two partitions as expected.

Just arch-chrooted to install grub to get it working again (I'd to read some Arch wiki entries before knowing what to do).
As I stated earlier no useful logs were recorded. Checked it with journalctl -b -3:

[... journal tails follows]
abr 10 12:26:11 relic rtkit-daemon[768]: Supervising 8 threads of 2 processes of 1 users.
abr 10 12:26:11 relic rtkit-daemon[768]: Supervising 8 threads of 2 processes of 1 users.
abr 10 12:26:52 relic rtkit-daemon[768]: Supervising 8 threads of 2 processes of 1 users.
abr 10 12:26:52 relic rtkit-daemon[768]: Supervising 8 threads of 2 processes of 1 users.
-- Boot 9871555b9ddc4e60952a16f63a139e5f --
abr 10 18:27:08 relic kernel: Linux version 6.8.4-arch1-1 (linux@archlinux) (gcc (GCC) 13.2.1 20230801, GNU ld (GNU Binutils) 2.42.0) #1 SMP PREEMPT_DYNAMIC Fri, 05 Apr 2024 00:14:23 +0000
abr 10 18:27:08 relic kernel: Command line: BOOT_IMAGE=/vmlinuz-linux root=UUID=e0b17959-1e30-41cb-98a3-58a75b2e793e rw loglevel=3 quiet
abr 10 18:27:08 relic kernel: BIOS-provided physical RAM map

[...]
mai 02 10:54:28 relic rtkit-daemon[804]: Supervising 8 threads of 2 processes of 1 users.
mai 02 10:54:28 relic rtkit-daemon[804]: Supervising 8 threads of 2 processes of 1 users.
mai 02 10:55:38 relic rtkit-daemon[804]: Supervising 8 threads of 2 processes of 1 users.
mai 02 10:55:38 relic rtkit-daemon[804]: Supervising 8 threads of 2 processes of 1 users.
-- Boot b2fa520a20214f3aa166ea1bef5da0ff --
mai 02 13:38:14 relic kernel: Linux version 6.8.8-arch1-1 (linux@archlinux) (gcc (GCC) 13.2.1 20240417, GNU ld (GNU Binutils) 2.42.0) #1 SMP PREEMPT_DYNAMIC Sun, 28 Apr 2024 15:59:47 +0000
mai 02 13:38:14 relic kernel: Command line: BOOT_IMAGE=/vmlinuz-linux root=UUID=e0b17959-1e30-41cb-98a3-58a75b2e793e rw loglevel=3 quiet
mai 02 13:38:14 relic kernel: BIOS-provided physical RAM map:

From pacman.log:

[... the kernel running at the first event]
[2024-04-09T09:13:00-0300] [ALPM] upgraded linux (6.8.2.arch2-1 -> 6.8.4.arch1-1)
[... the kernel running at the second event]
[2024-05-01T17:20:27-0300] [ALPM] upgraded linux-lts (6.6.25-1 -> 6.6.29-1)

So could you please advise? Looks like a hardware problem to you?

ASUS B550M-Plus DDR4 AM4
nvme Disk model: PCH-ADNA1-1TB (don't know which OEM made it)
https://www.techpowerup.com/ssd-specs/p … 1-tb.d1525

Thank you.

Last edited by alex.gt (2024-08-30 22:01:12)

Offline

#2 2024-05-05 07:25:39

seth
Member
From: Won't reply 2 private help req
Registered: 2012-09-03
Posts: 75,177

Re: [solved] crash or unresponsive, nvme uefi culprits?

As I stated earlier no useful logs were recorded. Checked it with journalctl -b -3:

On the second time I just pushed the reset button.

https://wiki.archlinux.org/title/Keyboa … el_(SysRq) - nb. that you've to explicitly enable that *before* you might need it!
But if the root device drops out, you won't be syncing to disk anyway.

As generic advice, nvme's suffer from broken APST implementations, https://wiki.archlinux.org/title/Solid_ … leshooting
"nvme_core.default_ps_max_latency_us=0 iommu=soft", maybe add "pcie_aspm=off" to the list.

If it's the nvme, some live distro run from an usb key would not be affected.

Offline

#3 2024-05-05 23:56:05

alex.gt
Member
Registered: 2022-12-08
Posts: 8

Re: [solved] crash or unresponsive, nvme uefi culprits?

Thanks @seth. I really admire your attention!

I think I don't have a sysreq key. I'll have to check the Windows XBows customizer driver to see if I can map it somewhere. Anyway it was a really good advice!

Followed thread in issue 1678184. It's rather old but I think you were involved roll

$ sudo smartctl -c /dev/nvme0
Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     3.50W       -        -    0  0  0  0        5       5
 1 +     3.30W       -        -    1  1  1  1       50     100
 2 +     2.80W       -        -    2  2  2  2       50     200
 3 -   0.1500W       -        -    3  3  3  3      500    5000
 4 -   0.0200W       -        -    4  4  4  4     2000   60000

I'll try a setting of 5000 us before giving up on APST.
Right now I'll insist without dealing with setting IOMMU and turning off ASPM.
Will check back if things evolve, specially to close the issue.

What really puzzles me is that it took ten months for this to happen for the first time.

Offline

#4 2024-05-06 08:09:01

seth
Member
From: Won't reply 2 private help req
Registered: 2012-09-03
Posts: 75,177

Re: [solved] crash or unresponsive, nvme uefi culprits?

I think I don't have a sysreq key.

It's the print key.

It's rather old but I think you were involved

Hey…! wink

What really puzzles me is that it took ten months for this to happen for the first time.

Kernel regression?
Does it happen w/ the LTS or latest 6.7.x kernels?
(posted journal segments only show 6.8)
https://wiki.archlinux.org/title/Arch_Linux_Archive

Offline

#5 2024-05-06 15:13:39

alex.gt
Member
Registered: 2022-12-08
Posts: 8

Re: [solved] crash or unresponsive, nvme uefi culprits?

It's the print key.

LOL and that's because computers are part of my life for more than 35 years.
Tested the sysrq sync function and journal acknowledged.

About the kernel versions:

  • incident 1, stable 6.8.4.arch1-1

  • incident 2, LTS 6.6.29-1

  • current, 6.8.9-arch1-1

Update: can't see transitions (sudo nvme get-feature -f 0x0c -H /dev/nvme0) to power state 4 anymore as expected!

Offline

#6 2024-08-30 22:00:49

alex.gt
Member
Registered: 2022-12-08
Posts: 8

Re: [solved] crash or unresponsive, nvme uefi culprits?

It happened a third time. So I bought and installed a new one.
Almost two months now and everything is fine.
I consider the question solved.

Offline

Board footer

Powered by FluxBB