You are not logged in.

#1 2026-01-12 23:12:55

cryptearth
Member
Registered: 2024-02-03
Posts: 1,962

nvme ssd about to die or just hickup?

as the hardware forum is for

Problems and questions concerning kernel and hardware support.

I don't see my topic fit there - and hence there're no other sections it would fit better I guess here's the correct one - although I guess this causes it to net get the attention I would like

anyway

https://lim.cryptearth.de/~cryptearth/journal.txt (@seth: yes, I know, it contains my public ipv6 - but that's scattered accross the forum anyway - but feel free to point it out anyway)

today my system ran into an issue - or rather: my nvme ssd did (in the log above the fun starts at 19:35:05 - roughly 2/3 to 3/4 down - but as I got it I included the full journal)
booted the system - all fine
started steam - still fine
started arma3 launcher - still fine
clicked "play" in the launcher to start the game proper - hell broke loose - or rather: it froze over
nothin - no mouse cursor movement, no num-lock state change, no tty change, no ssh in from phone - dead
my machine is a bit more than an arms length away from where i sit - and as i figured "well, crashed" it finally switched to tty and showed this on top:

I/O error, dev nvme0n1, sector 170866816 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 2

ok - nvme just died - tried to login anyway - which worked - and rebooted instantly
as the system rebooted the uefi POST took quite some time - and then suddenly pxe kicked in and booted from network
still all took quite some time - managed to boot archinstall - the nvme still showed up in lspci but was inaccessible (didn't showed up in lsblk)
shutdown - and thought: "well, that's it"
booted again - again into pxe - back into archinstall - and wow - nvme showed up again - which allowed me to get the journal

to get you the full picture: every once in a while I do get a warning about the nvme exceeding its temp rating at boot - until now I thought it's just some hotspot ... yes, when I installed it i made sure to remove the cover from the heat pad and screw down tightly - so it should work as intented

I ran smartctl - but nothin standing out - about 10% "used" - and as I only have the OS on it (home lives on a zfs pool) there's nothin I would lose (well, to ability to boot into an os as I don't have setup my pool to be able to boot from it) but the drive isn't that old (from oct 2022) and not used that much (only for os and some caches)
with its temp issue: is it just a bad batch and was doomed from manufacturing? was it maybe just some hickup maybe due to overload?

it's not about "throwing it out and replace it" - 512gb is about 60 bucks - and I don't even use it even half to it - 256gb would go for 30 - it's just: "is it toast already?"
a Samsung Electronics Co Ltd NVMe SSD Controller 980 (DRAM-less) should get me further than just about 3 years of just OS - i don't even use it as cache for my zfs pool (it was recommended against as such use case would had shred it within months - for that I use a regular sata ssd - and already the 2nd one - so yes, using flash as zfs cache does shred it fast)

maybe any test I may can provide which could give more insight?

thx anyway

Offline

#2 2026-01-12 23:46:08

RedGreen925
Member
Registered: 2025-11-03
Posts: 2

Re: nvme ssd about to die or just hickup?

cryptearth wrote:

as the hardware forum is for
maybe any test I may can provide which could give more insight?

thx anyway


Give qdiskinfo a go and see what it thinks of the smartctl results.

https://aur.archlinux.org/packages/qdiskinfo

Offline

#3 2026-01-13 11:15:37

Lone_Wolf
Administrator
From: Netherlands, Europe
Registered: 2005-10-04
Posts: 14,785

Re: nvme ssd about to die or just hickup?

Please add the output of smartctl.




Moderator Note

as the hardware forum is for

    Problems and questions concerning kernel and hardware support.

I don't see my topic fit there - and hence there're no other sections it would fit better I guess here's the correct one - although I guess this causes it to net get the attention I would like

Your problem could be caused by a kernel change, in my opinion it fits fine in the kernel & hardware board.
Moving to "Kernel & Hardware'.


Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.

clean chroot building not flexible enough ?
Try clean chroot manager by graysky

Offline

#4 2026-01-13 15:17:17

seth
Member
From: Don't DM me only for attention
Registered: 2012-09-03
Posts: 73,217

Re: nvme ssd about to die or just hickup?

screw down tightly

"too"? Make sure the board/card isn't under tension.

That aside and ceterum censo: https://wiki.archlinux.org/title/Solid_ … leshooting

@seth: yes, I know, it contains my public ipv6

Since you're collecting our IPs that's just fair tongue

Online

#5 2026-01-13 16:35:11

cryptearth
Member
Registered: 2024-02-03
Posts: 1,962

Re: nvme ssd about to die or just hickup?

RedGreen925 wrote:

Give qdiskinfo a go and see what it thinks of the smartctl results.

qdiskinfo reports all green - overall rating is "good 94%"

Lone_Wolf wrote:

Please add the output of smartctl.

here you go

smartctl -x /dev/nvme0
smartctl 7.5 2025-04-30 r5714 [x86_64-linux-6.18.5-arch1-1] (local build)
Copyright (C) 2002-25, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 980 500GB
Serial Number:                      S64DNF0T311968E
Firmware Version:                   2B4QFXO7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 500.107.862.016 [500 GB]
Unallocated NVM Capacity:           0
Controller ID:                      5
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          500.107.862.016 [500 GB]
Namespace 1 Utilization:            240.077.103.104 [240 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 d321b96266
Local Time is:                      Tue Jan 13 17:14:17 2026 CET
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0055):     Comp DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x0f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     85 Celsius
Namespace 1 Features (0x10):        NP_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     5.24W       -        -    0  0  0  0        0       0
 1 +     4.49W       -        -    1  1  1  1        0       0
 2 +     2.19W       -        -    2  2  2  2        0     500
 3 -   0.0500W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     1000    9000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning:                   0x00
Temperature:                        36 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    6%
Data Units Read:                    108.368.519 [55,4 TB]
Data Units Written:                 50.134.344 [25,6 TB]
Host Read Commands:                 1.476.762.504
Host Write Commands:                1.661.538.581
Controller Busy Time:               2.217
Power Cycles:                       942
Power On Hours:                     806
Unsafe Shutdowns:                   41
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    1334
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               36 Celsius
Temperature Sensor 2:               38 Celsius
Thermal Temp. 2 Transition Count:   79284
Thermal Temp. 2 Total Time:         71912

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

Self-test Log (NVMe Log 0x06, NSID 0xffffffff)
Self-test status: No self-test in progress
No Self-tests Logged

as shown: there's quite a high count of over-temp warnings
I also had a logger running with quite a high polling rate - and it seems to be only very short spikes - often at the start of some access - but sometimes even during a long time i/o - under constant load (copy a few gb) in does reach sometimes in the high 50s/low 60s - but stays there - hence I'm not sure if this spike is even to trust or if it's maybe just a faulty thermal diode p/n-junction
also: when the spike occurs it's always exactly 84°C

Lone_Wolf wrote:

Moderator Note
Your problem could be caused by a kernel change, in my opinion it fits fine in the kernel & hardware board.
Moving to "Kernel & Hardware'.

appreciated

seth wrote:

screw down tightly

"too"? Make sure the board/card isn't under tension.

board is a msi b550-a pro: https://de.msi.com/Motherboard/B550-A-PRO/support
the nvme is installed in the top-most slot between cpu and gpu
the installation is explained on pages 31/32 of the english manual which I did very varefull paying high attention as this was my first time installing a nvme
it's installed correctly with the little standoff at the correct length location - the "stick" is parallel to the board and there're no additional standoffs under it - as it should be
I also made sure to place the cover correctly and made sure to have the plastic peel cover removed
I only used a regular fitting screwdriver and only torqued "hand tight" - as in "firm but not overstressing" - i not used a powered screwdriver or any means of additional leverage
so, as this was my very first nvme, I guess I did all ok closely following the manual (yes, it's not rocket scince - but it was my first - and as it was just newly bought I paid extra attention)

seth wrote:

I'll give that a shot

seth wrote:

@seth: yes, I know, it contains my public ipv6

Since you're collecting our IPs that's just fair tongue

btw - off-topic: even if someone knows my current ipv6 - an incoming request shouldn't be able to "just bypass" the nat-"firewall" (yes, I know: NAT is NOT a firewall) of my router without me explicit open a port-forwarding - or am i wrong on this?

Offline

#6 2026-01-13 19:08:37

seth
Member
From: Don't DM me only for attention
Registered: 2012-09-03
Posts: 73,217

Re: nvme ssd about to die or just hickup?

smart data looks unsuspicious, see whether it's just APST.

There's no NAT w/ IPv6, your router might very much still block inbound traffic, though.
You can throw eg. https://pentest-tools.com/network-vulne … nline-nmap at it and see what's open.

Online

#7 2026-01-13 23:30:48

cryptearth
Member
Registered: 2024-02-03
Posts: 1,962

Re: nvme ssd about to die or just hickup?

seth wrote:

smart data looks unsuspicious, see whether it's just APST.

I'll report back if anything changes

off-top

seth wrote:

There's no NAT w/ IPv6, your router might very much still block inbound traffic, though.
You can throw eg. https://pentest-tools.com/network-vulne … nline-nmap at it and see what's open.

ok, didn't know that
anyway - the linked site doesn'T allow IPv6 (claims "invalid ip") - but I did i quick test from my root server: according to lsof my pc only has SSH (tcp/22) and DNS (tcp/5355) open (both ipv4 and ipv6) to public (*:22, *:5355) - all other listen ports are bound to localhost anyway
when I try to connect from my external root I just get a permission denied - unless I set a port-forwarding in my routers interface - then I'm able to connect
so it seems that my router does have a proper ipv6 firewall which blocks incoming traffic unless I set a specific rule - it even can redirect incomming ports (say ssh from public 2022 to intern 22) - but I guess that's down to my specific router rather than standard (fyi: avm 6660 cable, FW 8.21)
i do have the option to set my "modem-router-wifi"-combo into bridge mode and have a sbc with multiple nics - maybe gonna do some experiments when i find time for it
// off-top

Offline

Board footer

Powered by FluxBB