Kernel 6.1 breaks NVMe wakeup on laptop

liamur · 2023-01-28 21:45:02

Hi everyone,

Long time Arch user, first time poster. Until now, I've never had an issue that I couldn't find the solution to on the wiki or somewhere else. I have a Dell Latitude 5580 with an Intel Core i3-7100u and no external GPU, and a 512 GB Intel 660P SSD. Until kernel 6.1, literally every hardware feature of this computer worked perfectly, including my LUKS full disk encryption. With kernel 6.1 and every version since it (we're up to 6.1.8 and the problem persists), however, waking up from sleep breaks the OS entirely. This manifests as a broken screen locker, and, if I can get into a second terminal, "Input/output error" as the output of every command. The only way to fix it is to force shut down the computer with the power button. I have tried adding `iommu=soft` to the kernel command line, a commonly recommended fix for this kind of issue, and it makes no difference. I can tell this is an NVMe issue because the journal---which doesn't save, unfortunately---contains errors from the NVMe driver, which result in ext4 errors. The main NVMe error I've seen is (or is very close to) "nvme: unable to change power state from d3cold to d0".

What could the issue be and how could I fix it? So far, as a patch, I've been running upgrades and then forcing the kernel back to version 6.0.9, which does not have this problem. A friend of mine with a Dell XPS 15 from a few years ago doesn't have this issue, could it be a hardware problem in my computer?

progandy · 2023-01-28 21:58:31

There seems to be one similar report with some attempts at narrowing down the problems in the kernel mailing lists:
https://lore.kernel.org/all/20230104150 … @bhelgaas/
https://bugzilla.kernel.org/show_bug.cgi?id=216877

Last edited by progandy (2023-01-28 22:02:10)

topcat01 · 2023-01-28 23:49:07

What about linux-lts? Presumably that works?

liamur · 2023-01-29 22:22:24

Yes, linux-lts works. The issue is definitely traceable to 6.1, and that mailing list patch looks like the exact issue I'm having. I did some further research, and the error I was seeing comes from the PCIe subsystem, not the NVMe driver itself---it's an error message that gets passed upward to the relevant device driver.

liamur · 2023-02-02 23:17:57

This is just for reference and to help anybody else with this problem, but I captured the relevant kernel error messages that appear upon waking up while running the non-working kernel:

...
ACPI: PM: Waking up from system sleep state S3
ACPI: EC: interrupt unblocked
iwlwifi 0000:02:00.0: Unable to change power state from D3hot to D0, device inaccessible
nvme 0000:03:00.0: Unable to change power state from D3hot to D0, device inaccessible
ACPI: EC: event unblocked
...
nvme 0000:03:00.0: Unable to change power state from D3cold to D0, device inaccessible
nvme nvme0: Removing after probe failure status: -19
nvme0n1: detected capacity change from 1000215216 to 0
...

topcat01 · 2023-02-03 04:23:02

As of today, it looks like the change in question is going to be reverted.

liamur · 2023-02-19 01:18:08

Looks like it's been fixed (or something's been changed, because the issue is gone) in either 6.1.11 or 6.1.12, I'm not sure which.

loqs · 2023-02-19 11:36:06

liamur wrote:

Looks like it's been fixed (or something's been changed, because the issue is gone) in either 6.1.11 or 6.1.12, I'm not sure which.

6.1.12 https://bugs.archlinux.org/task/76918

nbanba · 2024-11-05 11:02:09

Hi

Not using ArchLinux for this issue but posting here because it's the only forum where people can easily find technical information whithout doing git archeology in the kernel tree or withour reviewing all mailing lists

I did encountered the issue on a production server which must never go in D3 state since Linux 6.1

The issue is still running betwen Linux 6.1 --> 6.12rc6 (mainline at the time I'm writing)
From my search, it's a combination between an Intel faulty PCIe chipset, and a DELL or ASUS (or more vendors ?) faulty ACPI firmware which set 'StorageD3Enable=1'.
First, note that PCIE ASPM is totally disable in the UEFI BIOS of the server encountered the issue, but downstream PCIE switch seems to don't care about and apply D3cold status to downstream NVME devices since Linux 6.1 (after 3 years without any issue with the same PCIE switch, ie: PLX PEX 8748 and the exact same NVME devices and BIOS ASPM disabled)

For the moment, on the DEBIAN production server with a PEX8748 card which had the issue since Linux 6.1, I did write a 'dirty workarround' for it's PLX PEX8748 card as every other things didn't work specially issuing on kernel cmdline 'pcie_aspm=off pcie_port_pm=off nvme_core.default_ps_max_latency_us=0 ' (it seems to have no effects and after 1 day of uptime) :

I still find in the logs after resume that a resume happens and and that D3cold state had been applyed to NVME drives and it's result in the total lost of the storage :

kernel: nvme 0000:08:00.0: Unable to change power state from D3cold to D0, device inaccessible
kernel: nvme 0000:0a:00.0: Unable to change power state from D3cold to D0, device inaccessible
kernel: nvme 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
kernel: nvme 0000:09:00.0: Unable to change power state from D3cold to D0, device inaccessible

and it has the effect of breaking totally the system as you can see, multi device had lost its 4 nvme drives "[____]" on a RAID5 array :

grep -A1 md127 /proc/mdstat
 md127 : active raid5 nvme1n1p1[1](F) nvme3n1p1[4](F) nvme2n1p1[2](F) nvme0n1p1[0](F)
      12830758912 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/0] [____]

Where I should have :

md127 : active raid5 nvme1n1p1[1] nvme3n1p1[4] nvme2n1p1[2] nvme0n1p1[0]
      12830758912 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]

While waiting for a fix in kernel, I did write a dirty workarround which consist to revert configuration made by ACPI which enable D3cold for NVME devices (a little bit like Microsoft Windows does, it keep NVME in D0 state during ASPM)

Here is my workarround :
1) thoroughly identify PCIE bus and PCIE downstream port (if a PCIE switch like a PLX card is connected to the PCIE root port) and identify the PCIE BUS address of the bind NVME drive (lscpi -vvt | grep -v Xeon and tree command in sysfs are your friend here)
2) second, write a script which disable D3cold capability for PCIE downstream port (PLX card child port, if any)
3) extend this script to disable D3cold capability on NVME drives attached to PCIE child port
4) execute script at reboot (with crontab or systemd service, as it's quick and dirty workarround, I'm using crontab at the moment)

Here is my script :

##############################################
#!/bin/bash
#
#  
# 20241104: written by nbanba
# NBA fixing NEW ACPI power management issue 
# NVME drive goes to d3cold and never go back to d0
#

# Since Linux 6.1 default ACPI BUG is to put NVME in D3COLD STATE ! 
# NVME drive never come back to D0 state (never wakup back)
# error: 
# nvme nvme3: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
# nvme nvme3: Does your device have a faulty power saving mode enabled?
# nvme nvme3: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
# nvme 0000:0b:00.0: Unable to change power state from D3cold to D0, device inaccessible
# nvme nvme3: Disabling device after reset failure: -19

# After this failiure, it becomes impossible to wakup NVME drives without rebooting

# Exemple (NOT FROM THIS SERVER - ON LAPTOP with PCIE ASPM enable ) : 
#  echo "0000:08:00.0" > /sys/bus/pci/drivers/nvme/unbind
#  echo "0000:02:00.0" > /sys/bus/pci/drivers/pcieport/unbind
#  setpci -s 08:00.0 CAP_PM+4.b=03 # D3hot
#  acpidbg -b "execute \_SB.PC00.PEG0.PXP._OFF"
#  acpidbg -b "execute \_SB.PC00.PEG0.PXP._ON"
#  #debian cmd# acpiexec -b "execute \_SB.PC00.PEG0.PXP._OFF"
#  #debian cmd# acpiexec -b "execute \_SB.PC00.PEG0.PXP._ON"
#  setpci -s 08:00.0 CAP_PM+4.b=0 # D0
#  echo "0000:02:00.0" > /sys/bus/pci/drivers/pcieport/bind
#  echo "0000:08:00.0" > /sys/bus/pci/drivers/nvme/bind
#  
# NVME never wakeup  here: 
# nvme 0000:08:00.0: not ready 65535ms after bus reset; giving up
# nvme 0000:08:00.0: Unable to change power state from D3cold to D0, device inaccessible
# nvme 0000:08:00.0: timed out waiting for pending transaction; performing function level reset anyway
# nvme 0000:08:00.0: not ready 1023ms after FLR; waiting
# nvme 0000:08:00.0: not ready 2047ms after FLR; waiting
# ...
# nvme 0000:08:00.0: not ready 65535ms after FLR; giving up 
# ...
# nvme 0000:08:00.0: not ready 32767ms after bus reset; waiting
# nvme 0000:08:00.0: not ready 65535ms after bus reset; giving up


#
#
sleep 60
echo "------------------------------------------------------"
echo "DISABLE D3COLD ON NVME U2"
echo "------------------------------------------------------"
echo " "
echo "Disable PEX 8748 d3cold for NVME drives"
echo "PEX 8748 has 8 U2 PCIE Gen3x4 port"
echo "1st port is 5 Port SATA controler"
echo "2nd port is 5 Port SATA controler"
echo "3rd port is empty - availiable for an U2 NVME drive"
echo "4th port is empty - availiable for an U2 NVME drive"
echo "5th port is a 16T NVME U2 drive"
echo "6th port is a 16T NVME U2 drive"
echo "7th port is a 16T NVME U2 drive"
echo "8th port is a 16T NVME U2 drive"
echo "ONLY DISABLE D3COLD ON THE 4 LAST DRIVES"
echo " "
echo "------------------------------------------------------"
echo "TODO ON THE 2 U2 FREE PORT AFTER PLUGGING NEW NVME DRIVE"
echo "------------------------------------------------------"
echo " "



_PLX_BUS='/sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0'

_SATA0_PORT='0000:03:08.0'
_SATA1_PORT='0000:03:09.0'

_NVME4_PORT='0000:03:0a.0' # UNUSED - FREE PORT
_NVME5_PORT='0000:03:0b.0' # UNUSED - FREE PORT

_NVME0_PORT='0000:03:10.0'
_NVME1_PORT='0000:03:11.0'
_NVME2_PORT='0000:03:12.0'
_NVME3_PORT='0000:03:13.0'


_SATA0_DEV='0000:04:00.0'
_SATA1_DEV='0000:05:00.0'

_NVME4_DEV='0000:06:00.0' # NOT BIND 
_NVME5_DEV='0000:07:00.0' # NOT BIND

_NVME0_DEV='0000:08:00.0'
_NVME1_DEV='0000:09:00.0'
_NVME2_DEV='0000:0a:00.0'
_NVME3_DEV='0000:0b:00.0'

# First disable D3COLD on NVME PLUGGED PCIE port
# - need to add FREE PORT when populated with NVME drives

# For the moment, we only consider the 4 plugged NVME port 
# num_port => 4

	echo "DISABLE D3COLD ON PLX CHILD PCIE PORT:"
	echo 0 >"$_PLX_BUS/$_NVME0_PORT/d3cold_allowed"
	echo 0 >"$_PLX_BUS/$_NVME1_PORT/d3cold_allowed"
	echo 0 >"$_PLX_BUS/$_NVME2_PORT/d3cold_allowed"
	echo 0 >"$_PLX_BUS/$_NVME3_PORT/d3cold_allowed"

# Control

	echo "$_PLX_BUS/$_NVME0_PORT/d3cold_allowed:"
	cat "$_PLX_BUS/$_NVME0_PORT/d3cold_allowed"
	echo "$_PLX_BUS/$_NVME1_PORT/d3cold_allowed:"
	cat "$_PLX_BUS/$_NVME1_PORT/d3cold_allowed"
	echo "$_PLX_BUS/$_NVME2_PORT/d3cold_allowed:"
	cat "$_PLX_BUS/$_NVME2_PORT/d3cold_allowed"
	echo "$_PLX_BUS/$_NVME3_PORT/d3cold_allowed:"
	cat "$_PLX_BUS/$_NVME3_PORT/d3cold_allowed"

# Second disable D3COLD on CONNECTED NVME drives
# - need to add NOT BIND NVME when populated with NVME drives -

# For the moment, we only consider the 4 plugged NVME port 
# num_dev => 4

	echo "DISABLE D3COLD ON NVME DRIVE ATTACHED TO PLX CHILD PCIE PORT:"
	echo 0 >"$_PLX_BUS/$_NVME0_PORT/$_NVME0_DEV/d3cold_allowed"
	echo 0 >"$_PLX_BUS/$_NVME1_PORT/$_NVME1_DEV/d3cold_allowed"
	echo 0 >"$_PLX_BUS/$_NVME2_PORT/$_NVME2_DEV/d3cold_allowed"
	echo 0 >"$_PLX_BUS/$_NVME3_PORT/$_NVME3_DEV/d3cold_allowed"

# Control

	echo "$_PLX_BUS/$_NVME0_PORT/$_NVME0_DEV/d3cold_allowed:"
	cat "$_PLX_BUS/$_NVME0_PORT/$_NVME0_DEV/d3cold_allowed"
	echo "$_PLX_BUS/$_NVME1_PORT/$_NVME1_DEV/d3cold_allowed:"
	cat "$_PLX_BUS/$_NVME1_PORT/$_NVME1_DEV/d3cold_allowed"
	echo "$_PLX_BUS/$_NVME2_PORT/$_NVME2_DEV/d3cold_allowed:"
	cat "$_PLX_BUS/$_NVME2_PORT/$_NVME2_DEV/d3cold_allowed"
	echo "$_PLX_BUS/$_NVME3_PORT/$_NVME3_DEV/d3cold_allowed:"
	cat "$_PLX_BUS/$_NVME3_PORT/$_NVME3_DEV/d3cold_allowed"


#
echo -e "\n---------------------------------------------"
echo "SUCESSFULLY DISABLED D3COLD ON NVME U2"
echo "---------------------------------------------"



##############################################

Now here is the cron line (in /etc/cron.d/nod3coldnvme:

root @reboot /etc/nvme_disable_d3cold.sh |logger -p daemon.info -t NVME_PM

And the output in syslog (or journal) :

CRON[4340]: (root) CMD (/etc/nvme_disable_d3cold.sh |logger -p daemon.info -t NVME_PM)
NVME_PM: ------------------------------------------------------
NVME_PM: DISABLE D3COLD ON NVME U2
NVME_PM: ------------------------------------------------------
NVME_PM:  
NVME_PM: Disable PEX 8748 d3cold for NVME drives
NVME_PM: PEX 8748 has 8 U2 PCIE Gen3x4 port
NVME_PM: 1st port is 5 Port SATA controler
NVME_PM: 2nd port is 5 Port SATA controler
NVME_PM: 3rd port is empty - availiable for an U2 NVME drive
NVME_PM: 4th port is empty - availiable for an U2 NVME drive
NVME_PM: 5th port is a 16T NVME U2 drive
NVME_PM: 6th port is a 16T NVME U2 drive
NVME_PM: 7th port is a 16T NVME U2 drive
NVME_PM: 8th port is a 16T NVME U2 drive
NVME_PM: ONLY DISABLE D3COLD ON THE 4 LAST DRIVES
NVME_PM:  
NVME_PM: ------------------------------------------------------
NVME_PM: TODO ON THE 2 U2 FREE PORT AFTER PLUGGING NVME DRIVE
NVME_PM: ------------------------------------------------------
NVME_PM:  
NVME_PM: DISABLE D3COLD ON PLX CHILD PCIE PORT:
NVME_PM: /sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/0000:03:10.0/d3cold_allowed:
NVME_PM: 0
NVME_PM: /sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/0000:03:11.0/d3cold_allowed:
NVME_PM: 0
NVME_PM: /sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/0000:03:12.0/d3cold_allowed:
NVME_PM: 0
NVME_PM: /sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/0000:03:13.0/d3cold_allowed:
NVME_PM: 0
NVME_PM: DISABLE D3COLD ON NVME DRIVE ATTACHED TO PLX CHILD PCIE PORT:
NVME_PM: /sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/0000:03:10.0/0000:08:00.0/d3cold_allowed:
NVME_PM: 0
NVME_PM: /sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/0000:03:11.0/0000:09:00.0/d3cold_allowed:
NVME_PM: 0
NVME_PM: /sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/0000:03:12.0/0000:0a:00.0/d3cold_allowed:
NVME_PM: 0
NVME_PM: /sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/0000:03:13.0/0000:0b:00.0/d3cold_allowed:
NVME_PM: 0
NVME_PM: 
NVME_PM: ---------------------------------------------
NVME_PM: SUCESSFULLY DISABLED D3COLD ON NVME U2
NVME_PM: ---------------------------------------------

It seems to work as this morning I didn't have to reboot this (big) server and I didn't find any log lines of the following type in journal :

kernel: nvme 0000:08:00.0: Unable to change power state from D3cold to D0, device inaccessible
...

Hope it could help people like me which really have some huge troubles when their production servers lost their nvme storage !

Please also note that from all my days of tests, I have the feeling that a reboot IS NOT ENOUGH to change APCI comportment of PCIE power management .
On another thread of this forum (speaking of D3cold too), someone says that for guru reason, people should reboot at least twice, and it may be an explanation of this comportment.
I recommend not rebooting but using the out-of-band controller to totally poweroff the server - wait 5 to 10 minutes (if possible electrically unplug the server) - and start it again.

Kind regards
nbanba

nbanba · 2024-11-06 08:43:40

Hi

It's now 2 days nvme disk don't go in D3cold state.
The dirty workaround provided yesterday here seems to be functional

Having a look at the sysfs this morning show that nothing on the machine change back the d3cold_allowed value to 1, perfect !

$ cat /sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/0000:03:10.0/d3cold_allowed
0

$ cat /sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/0000:03:11.0/d3cold_allowed
0

$ cat /sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/0000:03:12.0/d3cold_allowed
0

$ cat /sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/0000:03:13.0/d3cold_allowed
0

$ cat /sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/0000:03:10.0/0000:08:00.0/d3cold_allowed
0

$ cat /sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/0000:03:11.0/0000:09:00.0/d3cold_allowed
0

$ cat /sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/0000:03:12.0/0000:0a:00.0/d3cold_allowed
0

$ cat /sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0/0000:03:13.0/0000:0b:00.0/d3cold_allowed
0

So OK, it's dirty but seems to work enough to wait for a kernel patch.

Kind regards
nbanba

V1del · 2024-11-06 19:37:32

I'll split this off, since chances you really still have the same issue almost 2 years later is quite unlikely.

Split from https://bbs.archlinux.org/viewtopic.php?id=283139

On rereading, since you're not actually using Arch Linux I'll close this since chances the information here still helps Arch people is quite slim. While you did some good research, it's unlikely to really still be relevant to Arch users. If one can still repro this on a current Arch system feel free to open a new thread. The OP mentioned it was fixed: https://bbs.archlinux.org/viewtopic.php … 1#p2085501

Last edited by V1del (2024-11-06 19:47:03)

Arch Linux

#1 2023-01-28 21:45:02

Kernel 6.1 breaks NVMe wakeup on laptop

#2 2023-01-28 21:58:31

Re: Kernel 6.1 breaks NVMe wakeup on laptop

#3 2023-01-28 23:49:07

Re: Kernel 6.1 breaks NVMe wakeup on laptop

#4 2023-01-29 22:22:24

Re: Kernel 6.1 breaks NVMe wakeup on laptop

#5 2023-02-02 23:17:57

Re: Kernel 6.1 breaks NVMe wakeup on laptop

#6 2023-02-03 04:23:02

Re: Kernel 6.1 breaks NVMe wakeup on laptop

#7 2023-02-19 01:18:08

Re: Kernel 6.1 breaks NVMe wakeup on laptop

#8 2023-02-19 11:36:06

Re: Kernel 6.1 breaks NVMe wakeup on laptop

#9 2024-11-05 11:02:09

Re: Kernel 6.1 breaks NVMe wakeup on laptop

#10 2024-11-06 08:43:40

Re: Kernel 6.1 breaks NVMe wakeup on laptop

#11 2024-11-06 19:37:32

Re: Kernel 6.1 breaks NVMe wakeup on laptop

Board footer