NVME Issues [SOLVED]

Wodger · 2023-12-17 03:11:03

Good day all,

So, I'm trying to troubleshoot some strange behavior on my Samsung 980 Pro in the primary NVME slot (Asus Rog Strix B550-E board) - Slot=PCIE GEN 4) - Arch Linux installed here.

I have a 970 Evo Plus in the 2nd NVME slot (PCIE GEN 3).

Both NVME drives run the latest firmware from Samsung and have been checked within windows magician software for errors and passed without issues.

While gaming through heroic launcher with the game and prefix located on the 980Pro, after a while, the NVME drive fails and system goes unresponsive - after temporarily moving the journal to the 970 Evo (no logs to record) I got this:

Dec 17 00:01:45 WodgerArch kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Dec 17 00:01:45 WodgerArch kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
Dec 17 00:01:45 WodgerArch kernel: nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Dec 17 00:01:45 WodgerArch kernel: nvme 0000:01:00.0: enabling device (0000 -> 0002)
Dec 17 00:01:45 WodgerArch kernel: nvme nvme0: Disabling device after reset failure: -19

So I attempted to disable the power saving as suggested in the journal and the states were disabled according to the journal on reboot, but the drive still drops off after some time gaming (~15 - 20 minutes).

I monitored the drive with psensor and it showed the temperature of the drive was at most about 55 Deg C, no where near the threshold of slowdown ~78 Deg C and in Smart listing, it had not recorded being within that range. What I did see in there regarding the powerstates was:

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.49W       -        -    0  0  0  0        0       0
 1 +     4.48W       -        -    1  1  1  1        0     200
 2 +     3.18W       -        -    2  2  2  2        0    1000
 3 -   0.0400W       -        -    3  3  3  3     2000    1200
 4 -   0.0050W       -        -    4  4  4  4      500    9500

Although the power states are disabled, I thought that the power states were not meant to overlap, comparing this to the 970Evo:

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     7.80W       -        -    0  0  0  0        0       0
 1 +     6.00W       -        -    1  1  1  1        0       0
 2 +     3.40W       -        -    2  2  2  2        0       0
 3 -   0.0700W       -        -    3  3  3  3      210    1200
 4 -   0.0100W       -        -    4  4  4  4     2000    8000

If I move the game off the drive and only leave the prefix, it fails. Running both the game and prefix on the 970Evo I don't have an issue.
I have run 3 games so far (1 game on drive and all prefixes on drive), 1 game has no issues (prefix only on 980Pro), the other 2 fail and interestingly they are both UE5 games.
I will test a few more games later once I return to PC, I will also try swapping the drives around in the slots (although will lose performance on 980Pro might help to figure out if the slot or the drive itself).

Any thoughts on this would be helpful.

Thanks,

Pete

Last edited by Wodger (2023-12-20 03:35:36)

seth · 2023-12-17 09:49:29

https://wiki.archlinux.org/title/Solid_ … nd_support
Try "iommu=soft" and please post your complete system journal for an affected boot:

sudo journalctl -b -1 | curl -F 'file=@-' 0x0.st

for the previous one.

Wodger · 2023-12-17 12:25:53

seth wrote:

https://wiki.archlinux.org/title/Solid_ … nd_support
Try "iommu=soft" and please post your complete system journal for an affected boot:
sudo journalctl -b -1 | curl -F 'file=@-' 0x0.st
for the previous one.

Hello again mate,

So as requested, here are 2 affected boots, with kernel parameters used and time of failure:

Dec 17 20:04:02 - nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

Dec 17 20:15:12 - nvme_core.default_ps_max_latency_us=0 pcie_aspm=off iommu=soft

Both these failures were running "Satisfactory" U8 EA branch and loading into a modded save: ~ 15 - 20 seconds after loading failure occurs

Regards,

Pete

seth · 2023-12-17 15:38:31

Dec 17 20:12:15 archlinux kernel: perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
Dec 17 20:12:15 archlinux kernel: AMD-Vi: AMD IOMMUv2 loaded and initialized

Try "amd_iommu=off" or "amd_iommu=fullflush", also

Dec 17 20:12:16 archlinux systemd-fsck[355]: /dev/nvme0n1p2: clean, 628062/61022208 files, 30718240/244059136 blocks
Dec 17 20:12:16 archlinux kernel: EXT4-fs (nvme0n1p2): mounted filesystem bc47a9b9-2548-4368-9c8a-1b07df3f2e14 r/w with ordered data mode. Quota mode: none.
Dec 17 20:12:16 WodgerArch kernel: EXT4-fs (nvme0n1p2): re-mounted bc47a9b9-2548-4368-9c8a-1b07df3f2e14 r/w. Quota mode: none.
Dec 17 20:12:17 WodgerArch kernel: FAT-fs (nvme0n1p1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
Dec 17 20:12:17 WodgerArch kernel: EXT4-fs (nvme1n1p1): mounted filesystem 3363d5e6-f368-44c3-a9a2-588b2dd62a63 r/w with ordered data mode. Quota mode: none.

and

Dec 17 20:12:17 WodgerArch systemd[1]: Mounted /home/wodger/Drives/H.

Is there maybe a parallel windows installation around?

Wodger · 2023-12-17 16:29:38

seth wrote:

Dec 17 20:12:15 archlinux kernel: perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
Dec 17 20:12:15 archlinux kernel: AMD-Vi: AMD IOMMUv2 loaded and initialized

Try "amd_iommu=off" or "amd_iommu=fullflush", also

Dec 17 20:12:16 archlinux systemd-fsck[355]: /dev/nvme0n1p2: clean, 628062/61022208 files, 30718240/244059136 blocks
Dec 17 20:12:16 archlinux kernel: EXT4-fs (nvme0n1p2): mounted filesystem bc47a9b9-2548-4368-9c8a-1b07df3f2e14 r/w with ordered data mode. Quota mode: none.
Dec 17 20:12:16 WodgerArch kernel: EXT4-fs (nvme0n1p2): re-mounted bc47a9b9-2548-4368-9c8a-1b07df3f2e14 r/w. Quota mode: none.
Dec 17 20:12:17 WodgerArch kernel: FAT-fs (nvme0n1p1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
Dec 17 20:12:17 WodgerArch kernel: EXT4-fs (nvme1n1p1): mounted filesystem 3363d5e6-f368-44c3-a9a2-588b2dd62a63 r/w with ordered data mode. Quota mode: none.

and

Dec 17 20:12:17 WodgerArch systemd[1]: Mounted /home/wodger/Drives/H.

Is there maybe a parallel windows installation around?

Hi,

I will try those commands shortly.

Regarding windows - nah, its been gone for ages (only exists in a VM now) Actually, there is a windows install sitting on a USB enclosure (860 EVO, 2.5"), but that is not mounted or in FSTAB and only is used when the windows VM is booted up (mrs's laptop that died recently and no money to replace it)

That drive (H) is the 970 EVO Plus running on the 2nd slot - it is my HighSpeed Game drive and its also where the journal is going temporarily so I have some sort of idea about the crashes.

The crash also doesn't stop the PC, it hangs, and I end up having to hit the restart button as the whole system is non-responsive.

Pete

Last edited by Wodger (2023-12-17 16:33:53)

seth · 2023-12-17 16:38:52

Try https://wiki.archlinux.org/title/Keyboa … el_(SysRq)
If windows is only used in a VM it should™ be irrelevant.

Wodger · 2023-12-17 17:18:09

seth wrote:

Try https://wiki.archlinux.org/title/Keyboa … el_(SysRq)

> Fsck issue sorted.

> Added SysRq into system - will see if that helps if system goes unresponsive.

seth wrote:

...If windows is only used in a VM it should™ be irrelevant.

Really late for me now, so will pick this up tomorrow afternoon

Cheers again for the help

Pete

Wodger · 2023-12-20 03:35:20

Hi all,

Well, what a journey. So after trying all the suggestions from @Seth it was still an issue, so I swapped the NVME drives around between slots and I didn't suffer a crash, so I suspect it comes down something mainboard or power most likely.

I reset the bios in case that was the issue - nope. And so I started thinking when this issue appeared - It all began due to mrs's windows laptop dying and transferring her sata ssd drive to this machine.

Long story short, by moving the 970evo back into a PCI-E adapter into Slot 2 under the GPU and returning the SSD that was in an external USB adapter (due to the NVME drive in 2nd m.2 slot disabling 2 sata ports) back to sata port (now NVME is not there), the problem has vanished.

Very weird, but happy issue is now gone.

Pete

Arch Linux

#1 2023-12-17 03:11:03

NVME Issues [SOLVED]

#2 2023-12-17 09:49:29

Re: NVME Issues [SOLVED]

#3 2023-12-17 12:25:53

Re: NVME Issues [SOLVED]

#4 2023-12-17 15:38:31

Re: NVME Issues [SOLVED]

#5 2023-12-17 16:29:38

Re: NVME Issues [SOLVED]

#6 2023-12-17 16:38:52

Re: NVME Issues [SOLVED]

#7 2023-12-17 17:18:09

Re: NVME Issues [SOLVED]

#8 2023-12-20 03:35:20

Re: NVME Issues [SOLVED]

Board footer