You are not logged in.

#1 2025-10-31 16:38:35

Prouk
Member
From: Paris
Registered: 2025-10-31
Posts: 2

NVME controller down, hangs on probably heavy load (""AAA"" game)

Hello,
Kind of not used to ask help anywhere so tell me if do not include sufficient infos on the issue I'm facing.

Since some time now, when I try to get into some heavy gaming ( Borderlands 4, Arc Raiders ), both of my NVME disconnect in an instant ( happening every time ).
First the NVME where the game is installed, then millisecond later the system one too.
In order to get a good dmesg output, I have to ssh in, as the whole computer will just hang and force shutdown will corrupt journald file.

I tried power management kernel setting ( pcie aspm off ) after some searching on "controller down" issues.

So here I throw my bottle to the sea and some infos about my setup.

Kernel boot line :

options root=PARTUUID=1427d43a-9760-4711-89c8-ea350ddd1d05 zswap.enabled=0 rw rootfstype=ext4 init_on_alloc=1 vsyscall=emulate clearcpuid=514 pcie_aspm=off quiet

( init on alloc is there for latest nvidia driver, problem is there even without it and clearcpuid which is for some error on some wine programs )

MB : ASUS Rog Strix X570-E GAMING (BIOS v 5031)

CPU : Ryzen 9 5900x

NVMEs :
  - system : MP600 1To
  - games : MP600 2To

GPU : NVIDIA 3080ti

Linux kernel v: 6.17.5

dmesg : Pastebin

It's not shown in this dmesg cause I cut ssh too soon but system nvme will get controller down almost at the same time

small part of it :

[Oct31 16:13] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10[  +0.000004] nvme nvme1: Does your device have a faulty power saving mode enabled?
[  +0.000002] nvme nvme1: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
[  +0.021197] nvme1n1: Read(0x2) @ LBA 30659072, 32 blocks, Host Aborted Command (sct 0x3 / sc 0x71)
[  +0.000004] I/O error, dev nvme1n1, sector 30659072 op 0x0:(READ) flags 0x80700 phys_seg 2 prio class 2
[  +0.011797] nvme 0000:04:00.0: enabling device (0000 -> 0002)
[  +0.000071] nvme nvme1: Disabling device after reset failure: -19
[  +0.015929] Buffer I/O error on dev nvme1n1p1, logical block 73, lost async page write
[  +0.000006] Aborting journal on device nvme1n1p1-8.
[  +0.000003] EXT4-fs error (device nvme1n1p1) in ext4_reserve_inode_write:6313: Journal has aborted
[  +0.000001] Buffer I/O error on dev nvme1n1p1, logical block 9463, lost async page write[  +0.000001] Buffer I/O error on dev nvme1n1p1, logical block 243826688, lost sync page write
[  +0.000003] JBD2: I/O error when updating journal superblock for nvme1n1p1-8.
[  +0.000000] Buffer I/O error on dev nvme1n1p1, logical block 305659916, lost async page write
[  +0.000002] EXT4-fs error (device nvme1n1p1): ext4_dirty_inode:6517: inode #65544863: comm kworker/u97:7: mark_inode_dirty error
[  +0.000002] Buffer I/O error on dev nvme1n1p1, logical block 306184201, lost async page write
[  +0.000002] EXT4-fs error (device nvme1n1p1) in ext4_dirty_inode:6518: Journal has aborted
[  +0.000009] EXT4-fs error (device nvme1n1p1) in ext4_reserve_inode_write:6313: Journal has aborted
[  +0.000001] EXT4-fs error (device nvme1n1p1): mpage_map_and_submit_extent:2536: inode #65544863: comm kworker/u97:7: mark_inode_dirty error
[  +0.000002] EXT4-fs error (device nvme1n1p1): mpage_map_and_submit_extent:2538: comm kworker/u97:7: Failed to mark inode 65544863 dirty
[  +0.000005] EXT4-fs error (device nvme1n1p1) in ext4_do_writepages:2944: Journal has aborted
[  +0.000002] EXT4-fs warning (device nvme1n1p1): ext4_end_bio:368: I/O error 10 writing to inode 65544863 starting block 306317056)
[  +0.000003] EXT4-fs (nvme1n1p1): failed to convert unwritten extents to written extents -- potential data loss!  (inode 65544863, error -5)
[  +0.000004] Buffer I/O error on device nvme1n1p1, logical block 306316800
[  +0.000003] Buffer I/O error on device nvme1n1p1, logical block 306316801
[  +0.000002] Buffer I/O error on device nvme1n1p1, logical block 306316802
[  +0.000010] Buffer I/O error on dev nvme1n1p1, logical block 0, lost sync page write
[  +0.000002] EXT4-fs (nvme1n1p1): I/O error while writing superblock
[  +0.000661] EXT4-fs error (device nvme1n1p1): ext4_journal_check_start:87: comm WebSocketClient: Detected aborted journal
[  +0.000008] Buffer I/O error on dev nvme1n1p1, logical block 0, lost sync page write
[  +0.000003] EXT4-fs (nvme1n1p1): I/O error while writing superblock
[  +0.000002] EXT4-fs (nvme1n1p1): Remounting filesystem read-only
[  +0.010903] EXT4-fs (nvme1n1p1): shut down requested (2)
[  +0.091924] EXT4-fs warning (device nvme1n1p1): htree_dirblock_to_tree:1051: inode #65544780: lblock 0: comm PioneerGame.exe: error -5 reading directory block
[  +0.000060] EXT4-fs warning (device nvme1n1p1): htree_dirblock_to_tree:1051: inode #65544780: lblock 0: comm PioneerGame.exe: error -5 reading directory block
[  +0.006535] EXT4-fs warning (device nvme1n1p1): htree_dirblock_to_tree:1051: inode #65536148: lblock 0: comm PioneerGame.exe: error -5 reading directory block
[  +0.002553] EXT4-fs warning (device nvme1n1p1): htree_dirblock_to_tree:1051: inode #65536148: lblock 0: comm PioneerGame.exe: error -5 reading directory block
[  +1.126885] EXT4-fs warning (device nvme1n1p1): htree_dirblock_to_tree:1051: inode #65536148: lblock 0: comm PioneerGame.exe: error -5 reading directory block
[  +0.002362] EXT4-fs warning (device nvme1n1p1): dx_probe:791: inode #65536265: lblock 0: comm PioneerGame.exe: error -5 reading directory block
[  +0.000014] EXT4-fs warning (device nvme1n1p1): dx_probe:791: inode #65536265: lblock 0: comm PioneerGame.exe: error -5 reading directory block
[  +0.000015] EXT4-fs warning (device nvme1n1p1): dx_probe:791: inode #65536265: lblock 0: comm PioneerGame.exe: error -5 reading directory block
[  +0.000008] EXT4-fs warning (device nvme1n1p1): dx_probe:791: inode #65536265: lblock 0: comm PioneerGame.exe: error -5 reading directory block

smartctl result :

  - System NVME

=== START OF INFORMATION SECTION ===
Model Number:                       Force MP600
Serial Number:                      2014823000012856314E
Firmware Version:                   EGFM11.3
PCI Vendor/Subsystem ID:            0x1987
IEEE OUI Identifier:                0x6479a7
Total NVM Capacity:                 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,000,398,934,016 [2.00 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            6479a7 33525718aa
Local Time is:                      Fri Oct 31 17:21:52 2025 CET
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005d):     Comp DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x08):         Telmtry_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     90 Celsius
Critical Comp. Temp. Threshold:     95 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.78W       -        -    0  0  0  0        0       0
 1 +     6.75W       -        -    1  1  1  1        0       0
 2 +     5.23W       -        -    2  2  2  2        0       0
 3 -   0.0490W       -        -    3  3  3  3     2000    2000
 4 -   0.0018W       -        -    4  4  4  4    25000   25000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning:                   0x00
Temperature:                        40 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    3%
Data Units Read:                    158,231,679 [81.0 TB]
Data Units Written:                 136,781,795 [70.0 TB]
Host Read Commands:                 676,623,709
Host Write Commands:                332,336,534
Controller Busy Time:               2,043
Power Cycles:                       3,006
Power On Hours:                     17,252
Unsafe Shutdowns:                   320
Media and Data Integrity Errors:    0
Error Information Log Entries:      7,323
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 63 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
  0       7323     0  0x0014  0x4004  0x028            0     0     -  Invalid Field in Command

Self-test Log (NVMe Log 0x06, NSID 0xffffffff)
Self-test status: No self-test in progress
Num  Test_Description  Status                       Power_on_Hours  Failing_LBA  NSID Seg SCT Code
 0   Extended          Completed without error               17249            -     -   -   -    -
 1   Extended          Completed without error               17248            -     -   -   -    -
 2   Short             Completed without error               17248            -     -   -   -    -

  - Game NVME

=== START OF INFORMATION SECTION ===
Model Number:                       Force MP600
Serial Number:                      20178229000128555772
Firmware Version:                   EGFM11.3
PCI Vendor/Subsystem ID:            0x1987
IEEE OUI Identifier:                0x6479a7
Total NVM Capacity:                 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            6479a7 34c241d9dc
Local Time is:                      Fri Oct 31 17:21:38 2025 CET
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005d):     Comp DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x08):         Telmtry_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     90 Celsius
Critical Comp. Temp. Threshold:     95 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.78W       -        -    0  0  0  0        0       0
 1 +     6.75W       -        -    1  1  1  1        0       0
 2 +     5.23W       -        -    2  2  2  2        0       0
 3 -   0.0490W       -        -    3  3  3  3     2000    2000
 4 -   0.0018W       -        -    4  4  4  4    25000   25000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning:                   0x00
Temperature:                        47 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    5%
Data Units Read:                    84,246,199 [43.1 TB]
Data Units Written:                 92,575,868 [47.3 TB]
Host Read Commands:                 1,089,717,566
Host Write Commands:                1,169,340,761
Controller Busy Time:               2,571
Power Cycles:                       3,193
Power On Hours:                     17,265
Unsafe Shutdowns:                   506
Media and Data Integrity Errors:    0
Error Information Log Entries:      8,792
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 63 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
  0       8792     0  0x000c  0x4004  0x028            0     0     -  Invalid Field in Command

Self-test Log (NVMe Log 0x06, NSID 0xffffffff)
Self-test status: No self-test in progress
Num  Test_Description  Status                       Power_on_Hours  Failing_LBA  NSID Seg SCT Code
 0   Extended          Completed without error               17261            -     -   -   -    -
 1   Extended          Completed without error               17261            -     -   -   -    -
 2   Short             Completed without error               17261            -     -   -   -    -

Might be hardware problem but I ran some controller tests and all with bios utility and smartctl, wich are all passing successfully.

I don't know what I'm missing.

To whoever that read this, thank you XD and have a good night

Last edited by Prouk (2025-10-31 17:03:28)


I'll try my best, probably

Offline

#2 2025-10-31 17:42:48

cryptearth
Member
Registered: 2024-02-03
Posts: 1,828

Re: NVME controller down, hangs on probably heavy load (""AAA"" game)

does the issue also happen with LTS kernel?

Offline

#3 2025-11-01 08:05:48

Prouk
Member
From: Paris
Registered: 2025-10-31
Posts: 2

Re: NVME controller down, hangs on probably heavy load (""AAA"" game)

cryptearth wrote:

does the issue also happen with LTS kernel?

I'll try this today or tomorrow, sadly I won't have much time to test today


I'll try my best, probably

Offline

Board footer

Powered by FluxBB