You are not logged in.

#1 2023-08-11 06:13:00

maxxie
Member
Registered: 2023-08-11
Posts: 3

[SOLVED]Enabling dGPU causes NVMe drive failure after suspend & resume

Update: The latest BIOS update fixed this issue.

Hi everyone.
I've a HP ZBook Power laptop with a Ryzen 7 7840HS CPU and a RTX 4050 Max-Q GPU. It's been running Arch fine with the dGPU disabled in BIOS.
Yesterday, I did a full system upgrade, followed by enabling dGPU & installing the NVIDIA proprietary driver. The dGPU worked fine but then I discovered that a suspend & resume would cause the NVMe drive to be mounted as read-only. The system was unusable after this and the system journal was not written to disk.

I was able to save the output of "journalctl -b" using a USB drive: https://0x0.st/H_wy.txt

Excerpt from around line 4553:

...
Aug 11 10:12:22 zbook.local kernel: nvme 0000:04:00.0: Unable to change power state from D3cold to D0, device inaccessible
Aug 11 10:12:22 zbook.local kernel: nvme 0000:04:00.0: Unable to change power state from D3cold to D0, device inaccessible
Aug 11 10:12:22 zbook.local kernel: nvme nvme0: Disabling device after reset failure: -19
Aug 11 10:12:22 zbook.local kernel: nvme0n1: detected capacity change from 2000409264 to 0
Aug 11 10:12:22 zbook.local kernel: EXT4-fs warning (device nvme0n1p2): ext4_end_bio:343: I/O error 10 writing to inode 1835329 starting block 4375776)
Aug 11 10:12:22 zbook.local kernel: Buffer I/O error on device nvme0n1p2, logical block 3326944
...

Things I've tried (in this order):
1. Downgrade kernel version to the previous one (6.4.9-arch1-1 -> 6.4.6.arch1-1)
2. Remove the NVIDIA proprietary driver
3. Add kernel parameter `amd_iommu=fullflush`[1]
4. Disable dGPU in BIOS, the issue is gone
5. Enable dGPU in BIOS again, the issue is back again
6. Switch to linux-lts
7. Switch to linux-mainline
8. Disable dGPU in BIOS and switch back to stable kernel, the issue is gone again

Info of the NVMe drive:

# smartctl -a /dev/nvme0
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.4.9-arch1-1] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZVL41T0HBLB-00BH1
Serial Number:                      S6B7NJ0W271680
Firmware Version:                   HPS3NHAV
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Controller ID:                      7
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,024,209,543,168 [1.02 TB]
Namespace 1 Utilization:            539,495,309,312 [539 GB]
Namespace 1 Formatted LBA Size:     4096
Namespace 1 IEEE EUI-64:            002538 e23142a09a
Local Time is:                      Fri Aug 11 13:49:28 2023 CST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005e):     Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     83 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     5.94W       -        -    0  0  0  0        0       0
 1 +     5.29W       -        -    1  1  1  1        0       0
 2 +     2.86W       -        -    2  2  2  2        0       0
 3 -   0.0500W       -        -    3  3  3  3      200    2800
 4 -   0.0050W       -        -    4  4  4  4     4000   19000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 -     512       0         0
 1 +    4096       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        33 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    0%
Data Units Read:                    1,383,365 [708 GB]
Data Units Written:                 2,426,192 [1.24 TB]
Host Read Commands:                 13,970,420
Host Write Commands:                29,602,523
Controller Busy Time:               9
Power Cycles:                       548
Power On Hours:                     340
Unsafe Shutdowns:                   9
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               33 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
Num  Test_Description  Status                       Power_on_Hours  Failing_LBA  NSID Seg SCT Code
 0   Extended          Completed without error                 336            -     -   -   -    -
 1   Short             Completed without error                   3            -     -   -   -    -
 2   Short             Completed without error                   0            -     -   -   -    -

I keep the dGPU disabled for now and it's working fine, but is there anything else that I can try? Any help would be appreciated.

--
[1]: Recommended by this Arch wiki article: https://wiki.archlinux.org/title/Solid_ … nd_support

Last edited by maxxie (2023-10-02 19:09:11)

Offline

#2 2023-08-11 06:44:44

V1del
Forum Moderator
Registered: 2012-10-16
Posts: 25,005

Re: [SOLVED]Enabling dGPU causes NVMe drive failure after suspend & resume

Sounds firmware/mainboard bug, try getting a UEFI update?

Offline

#3 2023-08-11 07:28:26

seth
Member
From: Don't DM me only for attention
Registered: 2012-09-03
Posts: 72,582

Re: [SOLVED]Enabling dGPU causes NVMe drive failure after suspend & resume

Aug 11 10:01:39 zbook.local kernel: DMI: HP HP ZBook Power 15.6 inch G10 A Mobile Workstation PC/8B95, BIOS V85 Ver. 01.01.06 05/18/2023

If there's no FW update or it doesn't help (try first!)

Aug 11 10:11:16 zbook.local kernel: PM: suspend entry (s2idle)

No S3?
Did you try "iommu=soft" and "nvme_core.default_ps_max_latency_us=0"?
Is this on battery or AC? Does it matter?

Offline

#4 2023-08-11 10:21:34

maxxie
Member
Registered: 2023-08-11
Posts: 3

Re: [SOLVED]Enabling dGPU causes NVMe drive failure after suspend & resume

Thank you for the replies!

V1del wrote:

Sounds firmware/mainboard bug, try getting a UEFI update?

I've just updated the BIOS to the newest version.

Aug 11 17:47:04 zbook.local kernel: DMI: HP HP ZBook Power 15.6 inch G10 A Mobile Workstation PC/8B95, BIOS V85 Ver. 01.01.10 07/05/2023

Unfortunately the issue is still there.

seth wrote:

No S3?

It seems so:

$ cat /sys/power/mem_sleep 
[s2idle]
seth wrote:

Did you try "iommu=soft" and "nvme_core.default_ps_max_latency_us=0"?

I just tried them, didn't help.

seth wrote:

Is this on battery or AC? Does it matter?

It's the same on battery and AC power.

Last edited by maxxie (2023-08-11 10:24:43)

Offline

#5 2023-08-11 14:19:21

seth
Member
From: Don't DM me only for attention
Registered: 2012-09-03
Posts: 72,582

Re: [SOLVED]Enabling dGPU causes NVMe drive failure after suspend & resume

Still no S3 after the BIOS update?
Can you switch between internal, hybrid and dedicated GPU in the BIOS?
Does the dedicated GPU alone also cause this?

Offline

#6 2023-08-11 14:34:00

maxxie
Member
Registered: 2023-08-11
Posts: 3

Re: [SOLVED]Enabling dGPU causes NVMe drive failure after suspend & resume

Still no S3. There's only s2idle even after the BIOS update.
The BIOS only allows me to switch between UMA and hybrid graphics, so I'm not able to verify if the dedicated GPU alone can cause this.

Last edited by maxxie (2023-08-11 14:38:14)

Offline

Board footer

Powered by FluxBB