You are not logged in.
Hi,
since upgrading to 6.17 (both 6.17.6 and 6.17.7-arch1) I get sporadic ATA timeouts on **every** SATA device attached to the chipset AHCI controller. NVMe (M.2) is unaffected.
Hardware
--------
CPU: AMD Ryzen 7 9700X
MB: MSI MAG B850M MORTAR WIFI (BIOS A.30)
SATA: AMD 600-series AHCI (00:0d.0 1022:43f6) ─ ports 0-3
Disks: WDC WD20SPZX-22U 2 TB (XFS, /mnt/volume_data_2)
Kingston SKC6002 2 TB (btrfs, /mnt/volume_data_3)
Samsung 850 EVO 500 GB (NTFS, data only)
All disks pass SMART.
Kernel messages (typical burst)
-------------------------------
ata1: hard resetting link
ata1.00: exception Emask 0x0 SAct 0x40120000 SErr 0x50000 action 0x6 frozen
ata1: SError: { PHYRdyChg CommWake }
ata1.00: failed command: READ FPDMA QUEUED
ata1.00: cmd 60/20:88:.../40 tag 17 ncq dma 16384 in
res 40/00:00:00:4f:c2/... Emask 0x4 (timeout)
ata1.00: status: { DRDY }
... (identical for tags 20,30) ...
ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata1.00: configured for UDMA/133
ata1: EH complete
Frequency: 1–3 bursts per hour under light desktop load.
Boot-time time-outs on /dev/disk/by-uuid/... are also seen occasionally.
Conclusion
----------
Plain regression in 6.17 AHCI/NCQ code path on AMD 600-series chipsets.
Upstream commit 74f9e3ab2c75 ("ahci: fix ncq timeout on AMD gen5") already addresses this and is included in 6.18-rc1.
Request: please back-port the above patch to 6.17-arch until 6.18 reaches [core].
Offline
Hi everyone,
I'm the OP from the original thread back in November. Unfortunately, the issue is still persisting even after upgrading to kernel 6.18.2-arch2-1, and I'm seeking further assistance.
## Current Status
**Kernel:** 6.18.2-arch2-1 (confirmed the patch "ahci: fix ncq timeout on AMD gen5" 74f9e3ab2c75 is included)
**Issue Frequency:** 1-2 times per week under light desktop load (browsing files in Dolphin, occasional data access). The freezes last 2-3 seconds before the system recovers.
**Disks Affected:** All SATA devices connected to the AMD 600-series AHCI controller (ports 0-3). NVMe devices are completely unaffected.
**SMART Status:** All disks report healthy SMART data - this is NOT a failing drive issue.
## Hardware Configuration
- **CPU:** AMD Ryzen 7 9700X
- **MB:** MSI MAG B850M MORTAR WIFI (BIOS A.30)
- **SATA Controller:** AMD 600-series AHCI (00:0d.0 1022:43f6)
- **Disks:**
- Samsung 850 EVO 500GB (NTFS, data only)
- Kingston SKC6002 2TB (btrfs, /mnt/volume_data_3)
- WDC WD20SPZX-22U 2TB (XFS, /mnt/volume_data_2)
- **PSU:** PCCooler YS1000 1000W 80+ Gold
- **Cables:** Stock SATA cables included with motherboard
## Typical Error Log (from journalctl -k)
```
[46522.333202] ata1.00: exception Emask 0x0 SAct 0x300001 SErr 0x50000 action 0x6 frozen
[46522.333208] ata1: SError: { PHYRdyChg CommWake }
[46522.333209] ata1.00: failed command: READ FPDMA QUEUED
[46522.333210] ata1.00: cmd 60/20:00:a0:30:3a/00:00:af:00:00/40 tag 0 ncq dma 16384 in
[46522.333213] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[46522.333219] ata1.00: status: { DRDY }
[46522.333220] ata1: hard resetting link
[46522.798195] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[46522.800787] ata1.00: configured for UDMA/133
[46522.800914] I/O error, dev sda, sector 85877912 op 0x0:(READ) flags 0x80700 phys_seg 13 prio class 2
```
## Questions for the Community
1. For those who experienced this on AMD 600-series, did the 6.18 kernel actually resolve it, or did you need additional workarounds?
2. Is there a way to disable NCQ only for specific disks rather than globally? I have mixed SSDs and HDDs and would prefer to keep NCQ on the SSDs.
3. Could this be related to the specific SATA cable quality or PSU power delivery issues, despite SMART being clean?
4. Are there any other kernel parameters or udev rules that have proven effective for this specific combination?
The issue is reproducible enough to be annoying, but not frequent enough to make daily debugging easy. Any insights would be greatly appreciated.
Offline
It seems that linux-lts is still on the 6.12 series, can you switch and check if it works better?
If it does, that at least confirm it's a kernel regression.
If it doesn't help, but downgrading to 6.16 does, then the cause is one of relatively few commits which got backported and it should be easier to track down.
If 6.12 and 6.16 are broken, that's some HW change then, no need to worry about SW.
Offline