Unusual ATA errors across multiple SSD devices Kernel 6.9.1 [Solved]

splurben · 2024-05-25 04:25:44

I've been having problems since Kernel 6.9.1.

Linux fenrir 6.9.1-arch1-2 #1 SMP PREEMPT_DYNAMIC Wed, 22 May 2024 13:47:07 +0000 x86_64 GNU/Linux

I thought that my SSD was failing or needed new firmware but I discovered that the latest firmware was installed and after I swapped out the first SSD for a second one I discovered that the same errors where coming up with the exact same linux kernel along with nVidia driver problems. This error only began with Kernel 6.9.1.

«Logs showing that two different devices throw similar errors — an INTEL SSD and a CRUCIAL SSD — https://pastebin.com/H2n0uHzj»

«lshw output with second tested normally functioning SSD in place — https://pastebin.com/ZspaYvKT»

A CRUCIAL BX500 2.5 SSD 240GB CT240BX500SSD1 FW M6CR054 and an INTEL SSDSC2CT240A4 FW 335u with mirrored filesystems produce similar errors and long pauses in video playback that correspond to the errors.

It is very likely that this is due to a timeout on the

Status: DRDY :: Error ata2.00: failed command: WRITE FPDMA QUEUED

when videos are playing. This is an old server that I now use as an entertainment system.

I will also post a long bit of logs, from two days before the shift from the old SSD to the tested second ssd. https://gendersplice.com/paste/72-hours-log.txt and the

$dmesg

output https://gendersplice.com/paste/2024-05-25_dmesg.txt

I've been working in *nix since 1994 and I haven't seen this kind of crossover on divergent devices, although they are both SATA connected SSD devices.

Any ideas anyone?

Last edited by splurben (2024-05-26 07:46:36)

gromit · 2024-05-25 10:41:54

Which CPU does this system have?

Also you wrote that this errros began with 6.9.1, did you try if its also present on linux-lts or linux 6.8.9?

seth · 2024-05-25 14:17:01

https://bbs.archlinux.org/viewtopic.php?id=296102
There've been severaly recent incidents of overoptimistic med_power_with_dipm usage…

splurben · 2024-05-26 02:40:04

gromit wrote:

Which CPU does this system have?
Also you wrote that this errros began with 6.9.1, did you try if its also present on linux-lts or linux 6.8.9?

I have not yet tried to downgrade the kernel; I will probably get to that today.

I have also found a spare SATA cable which I will swap out today. I'd be embarrassed if it was the cable! However, I just moved house and I couldn't find a spare.

     *-cpu
          description: CPU
          product: Intel(R) Core(TM) i7-3820 CPU @ 3.60GHz
          vendor: Intel Corp.
          physical id: 4
          bus info: cpu@0
          version: 6.45.7
          slot: SOCKET 0
          size: 3799MHz
          capacity: 4GHz
          width: 64 bits
          clock: 100MHz
          capabilities: lm fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp x86-64 constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm pti ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid xsaveopt dtherm ida arat pln pts vnmi md_clear flush_l1d cpufreq
          configuration: cores=4 enabledcores=4 microcode=1818 threads=8
        *-cache:0
             description: L1 cache
             physical id: 5
             slot: CPU Internal L1
             size: 256KiB
             capacity: 256KiB
             capabilities: internal write-back
             configuration: level=1
        *-cache:1
             description: L2 cache
             physical id: 6
             slot: CPU Internal L2
             size: 1MiB
             capacity: 1MiB
             capabilities: internal write-back unified
             configuration: level=2
        *-cache:2
             description: L3 cache
             physical id: 7
             slot: CPU Internal L3
             size: 10MiB
             capacity: 10MiB
             capabilities: internal write-back unified
             configuration: level=3

splurben · 2024-05-26 04:06:33

Replaced SATA cable. No change.

splurben · 2024-05-26 07:45:26

Downgraded kernel and headers to 6.8.1 and downgraded nVidia to nvidia-525.147.05 and everything is fixed!

So to summarise:

nVidia drivers stopped working at or near Kernel 6.9.1 and/or nVidia version 550. Downgraded to Kernel 6.8.1 and nVidia 525 and everything works.
SATA SSD devices on SATA controller: Intel Corporation C600/X79 series chipset 6-Port SATA AHCI Controller (rev 05) had lots of problems on Kernel 6.9.1. Downgrade removed these problems as well.

Last edited by splurben (2024-05-26 07:58:57)

splurben · 2024-05-26 07:48:31

Obviously a temporary solution.

seth · 2024-05-26 07:56:47

Did you check the ALPM condition?

splurben · 2024-05-26 08:09:17

seth wrote:

Did you check the ALPM condition?

There were no pacman updates for either package. I'd let it go for a few weeks as I was very busy. Nothing helped. I'll test new kernels every few weeks and see if it continues and will post here if a kernel update works better than 6.8.1, but can't have the whole computer lock up for 60 seconds every 3 minutes because of a problem with the new kernel. I'm not a kernel diagnostic technician.

Kernel 6.9.1 appears to break nVidia and some SATA SSD controllers. If anyone wants me to provide logs or other diagnostic data I'm happy to provide it. I'll be happy to go to the current kernel and provide live diagnostic data. Nonetheless, I need to be able to use my system normally and I'm not leaving my SSD and video rendition in strife otherwise.

seth · 2024-05-26 08:15:47

Either what package?
V1del in the other thread linked https://wiki.archlinux.org/title/Power_ … Management and a mentioned, recently a bunch of overly optimistic med_power_with_dipm have shown up.

Edit: stray blank

grep . /sys/class/scsi_host/host*/link_power_management_policy

on a good and bad kernel.
(You can install 6.9 and LTS kernel in parallel)

There're no details on your nvidia related issues ITT but the 550xx drivers have known (and severe) bugs, there'd be https://aur.archlinux.org/packages/nvidia-535xx-dkms which should™ build w/ 6.9

Last edited by seth (2024-05-26 13:45:30)

splurben · 2024-05-26 09:07:46

seth wrote:

Either what package?
V1del in the other thread linked https://wiki.archlinux.org/title/Power_ … Management and a mentioned, recently a bunch of overly optimistic med_power_with_dipm have shown up.
grep . /sy s/class/scsi_host/host*/link_power_management_policy
on a good and bad kernel.
(You can install 6.9 and LTS kernel in parallel)
There're no details on your nvidia related issues ITT but the 550xx drivers have known (and severe) bugs, there'd be https://aur.archlinux.org/packages/nvidia-535xx-dkms which should™ build w/ 6.9

By "SATA Active Link Power Management " ALPM reference — I had never experienced this problem until 6.9.1 and the reference post has not been revised as SOLVED.

There is no "link_power_management_policy" file assigned to

lrwxrwxrwx 1 root root 0 May 26 14:18 host6 -> ../../devices/pci0000:00/0000:00:02.0/0000:02:00.0/host6/scsi_host/host6

and the default in Arch is supposed to be 'Maximum Performance' not med_power_with_dipm in the absence of this configuration file.

The section in question is also labelled as: Spurious Content

I'd tried nvidia-535xx-dkms while on kernel 6.9.1 and I still had failed module loading until downgrade to 6.8.1 and nVidia 525xx-dkms.

nVidia Downgrade to 525xx fixed that problem instantly once the kernel was downgraded and not before.

These are not 'isolation diagnostics'. This is not a test system. I'm using it in a non-mission-critical functional situation. nVidia 550xx drivers are definitely completely screwed and Kernel 6.9.1 has some revised or stripped content that I'm not qualified to comment upon except to say: "It doesn't work with some SATA SSD hardware scenarios and definitely doesn't work at all with a lot of nVidia video adapters."

As I said, I'm happy to install latest kernel side-by-side and provide diagnostic data based on the latest kernel being booted and upgrade nVidia drivers for the same purposes, but I can only afford to do that on Sundays or Mondays AWST_UTC+8 Perth time as these are my days off.

Last edited by splurben (2024-05-26 09:09:37)

seth · 2024-05-26 13:45:06

Random broad claims in the wiki are not authorative against the data provided by the actual system. As mentioned, several systems have shown up with bogus defaults (NOT configured locally) to med_power_with_dipm

Tbc, you don't *have* to do anything, but as you correctly figured, downgrading is "Obviously a temporary solution" and if you want to get an opinion what what you *could* do it's to actually run that grep because the error has all the hallmarks of a PM issue (idk. what your screenshow is meant to prove but see the second entry in the blue note at the bottom of the paragraph)

Fwwi and if there's *really* (don't believe, check) no ALPM support on your system, there's also been https://bbs.archlinux.org/viewtopic.php … 1#p2173651 which *may* cause this.

loqs · 2024-05-28 21:57:10

From the upstream kernel developer responsible for ATA in response to a similar issue:

Niklas Cassel wrote:

We recently (kernel v6.9) enabled LPM for all AHCI controllers if:
-The AHCI controller reports that it supports LPM, and
-The drive reports that it supports LPM (DIPM), and
-CONFIG_SATA_MOBILE_LPM_POLICY=3, and
-The port is not defined as external in the per port PxCMD register, and
-The port is not defined as hotplug capable in the per port PxCMD register.
However, there appears to be some drives (usually cheap ones that we've never
heard about) that reports that they support DIPM, but when actually turning
it on, they stop working.

Do you want help contacting upstream or devising a quirk for your system?
Edit:
There is already an entry for a BX100:

	/* Crucial BX100 SSD 500GB has broken LPM support */
	{ "CT500BX100SSD1",		NULL,	ATA_HORKAGE_NOLPM },

You could base a quirk for your CT240BX500SSD1 on it such as:

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index c449d60d9bb9..73c9c43cfcc0 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -4183,6 +4183,9 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = {
 	/* Crucial BX100 SSD 500GB has broken LPM support */
 	{ "CT500BX100SSD1",		NULL,	ATA_HORKAGE_NOLPM },
 
+	/* Crucial BX500 SSD 500GB has broken LPM support */
+	{ "CT500BX500SSD1",		NULL,	ATA_HORKAGE_NOLPM },
+
 	/* 512GB MX100 with MU01 firmware has both queued TRIM and LPM issues */
 	{ "Crucial_CT512MX100*",	"MU01",	ATA_HORKAGE_NO_NCQ_TRIM |
 						ATA_HORKAGE_ZERO_AFTER_TRIM |

Last edited by loqs (2024-05-28 22:15:04)

splurben · 2024-05-29 10:33:15

seth wrote:

Random broad claims in the wiki are not authorative against the data provided by the actual system. As mentioned, several systems have shown up with bogus defaults (NOT configured locally) to med_power_with_dipm
Tbc, you don't *have* to do anything, but as you correctly figured, downgrading is "Obviously a temporary solution" and if you want to get an opinion what what you *could* do it's to actually run that grep because the error has all the hallmarks of a PM issue (idk. what your screenshow is meant to prove but see the second entry in the blue note at the bottom of the paragraph)
Fwwi and if there's *really* (don't believe, check) no ALPM support on your system, there's also been https://bbs.archlinux.org/viewtopic.php … 1#p2173651 which *may* cause this.

Thank you for your candour!

I will investigate these new options at my next opportunity. I greatly appreciate your attention and interest.

splurben · 2024-05-29 10:33:52

loqs wrote:

From the upstream kernel developer responsible for ATA in response to a similar issue:
Niklas Cassel wrote:
We recently (kernel v6.9) enabled LPM for all AHCI controllers if:
-The AHCI controller reports that it supports LPM, and
-The drive reports that it supports LPM (DIPM), and
-CONFIG_SATA_MOBILE_LPM_POLICY=3, and
-The port is not defined as external in the per port PxCMD register, and
-The port is not defined as hotplug capable in the per port PxCMD register.
However, there appears to be some drives (usually cheap ones that we've never
heard about) that reports that they support DIPM, but when actually turning
it on, they stop working.
Do you want help contacting upstream or devising a quirk for your system?
Edit:
There is already an entry for a BX100:
	/* Crucial BX100 SSD 500GB has broken LPM support */
	{ "CT500BX100SSD1",		NULL,	ATA_HORKAGE_NOLPM },
You could base a quirk for your CT240BX500SSD1 on it such as:
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index c449d60d9bb9..73c9c43cfcc0 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -4183,6 +4183,9 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = {
 	/* Crucial BX100 SSD 500GB has broken LPM support */
 	{ "CT500BX100SSD1",		NULL,	ATA_HORKAGE_NOLPM },
 
+	/* Crucial BX500 SSD 500GB has broken LPM support */
+	{ "CT500BX500SSD1",		NULL,	ATA_HORKAGE_NOLPM },
+
 	/* 512GB MX100 with MU01 firmware has both queued TRIM and LPM issues */
 	{ "Crucial_CT512MX100*",	"MU01",	ATA_HORKAGE_NO_NCQ_TRIM |
 						ATA_HORKAGE_ZERO_AFTER_TRIM |

Thank you for your candour!

I will investigate these options at my next opportunity. I greatly appreciate your attention and interest.

loqs · 2024-05-29 12:16:36

Would it help if I built a kernel for you with the proposed quirk applied?

splurben · 2024-06-02 21:28:29

I'm curious that this isn't considered top-shelf importance for a kernel revision. An unmodified kernel default is invalid. Two revisions of the kernel still contain this invalidity. I'm sure only 30% of users are on older hardware with SATA SSDs

So, Crucial and Intel are considered cheap no-name brands of SSD? Interesting.

I will enable hot-plug on my SSD, according to the previous posts that should remove this bug that only affects "Cheap no-name brands like INTEL and CRUCIAL."

My first chance to look at this is my husband's 56th Birthday and our 25th Anniversary. So, not spending hours applying modifications. Not sure why this isn't important to upstream.

By the way, these are my SAS attached BTRFS Raid, the INTEL SATA SSD is /sys/class/scsi_host/host6 and doesn't report med_power_with_dipm, hmmm:

~]$ grep . /sys/class/scsi_host/host*/link_power_management_policy
/sys/class/scsi_host/host0/link_power_management_policy:med_power_with_dipm
/sys/class/scsi_host/host1/link_power_management_policy:med_power_with_dipm
/sys/class/scsi_host/host2/link_power_management_policy:med_power_with_dipm
/sys/class/scsi_host/host3/link_power_management_policy:med_power_with_dipm
/sys/class/scsi_host/host4/link_power_management_policy:med_power_with_dipm
/sys/class/scsi_host/host5/link_power_management_policy:med_power_with_dipm

What else can I post to help with an integrated fix? Or, are we to understand it is simply not acceptable to use INTEL SSD SATA attached SSD storage in any case?

Last edited by splurben (2024-06-02 22:24:49)

loqs · 2024-06-02 21:38:18

splurben wrote:

I'm curious that this isn't considered top-shelf importance for a kernel revision. An unmodified kernel default is invalid. Two revisions of the kernel still contain this invalidity. I'm sure only 30% of users are on older hardware with SATA SSDs

Have you reported your device to the upstream kernel developers so a quirk for the device can be added?

seth · 2024-06-02 21:41:17

The problem isn't all sata ssd's but

some drives (usually cheap ones that we've never heard about) that reports that they support DIPM, but when actually turning it on, they stop working

so this needs quirks for broken hardware and those need to be reported to be added.

So, not spending hours applying modifications.

Add the rule in https://wiki.archlinux.org/title/Power_ … Management but of course use medium_power instead of med_power_with_dipm
Depending on how good you are at exiting vim, this is just gonna take a couple of seconds.

Happy birthday to your husband - and condolences for being married

splurben · 2024-06-02 22:57:44

seth wrote:

The problem isn't all sata ssd's but
some drives (usually cheap ones that we've never heard about) that reports that they support DIPM, but when actually turning it on, they stop working
so this needs quirks for broken hardware and those need to be reported to be added.
So, not spending hours applying modifications.
Add the rule in https://wiki.archlinux.org/title/Power_ … Management but of course use medium_power instead of med_power_with_dipm
Depending on how good you are at exiting vim, this is just gonna take a couple of seconds.
Happy birthday to your husband - and condolences for being married

Happily Married for 25 years. I understand that's rare. It was love at first sight as well which confounds many people. I vividly remember the exact moment I first locked eyes with my husband; I understand that fact is also rare and contentious for many people.

I understand this is only for us old guys (56) that have only been trained on and using UNIX since the early 90s and like to keep functioning usable hardware going; this system is 10 years old. I crashed my dad's INTERDATA 70 in 1972 at the age of 4 by pressing the big red button. Dad used to 'repair' punch-cards by cutting tiny bits of tape and replacing punch-card chads to fix octal and hexadecimal errors in machine code as the progenitor Director of Computer Science at Western Washington University on the ridiculously huge IBM 370 they had there.

However, the INTEL and CRUCIAL SSDs tested are relatively new (2-3 years) and are certainly not cheap no-name brands and they have their latest firmware and report they are well within their normal life-span. Also, as you might see above, the only drives on this system that are reporting "med_power_with_dipm" are the six 4TB conventional Seagate HDD drives attached with a PCIe SAS card delivering 312MB/s performance on a 10-year-old machine, the SSD is not reporting DIPM as it's /sys/class/scsi_host/host6. So, unless, the MSI MB is buggy (and I'm not averse to this consideration as MSI don't claim to give a shit about *NIX) the fundamental criticism of this argument is mute.

I can't change the setting for a drive that doesn't report it's employing the DIPM setting. There's something else going on here that started with kernel 6.9 at some stage.

I'm not working today as I now live in Western Australia and today is a Public Holiday. I am absolutely happy to provide further diagnostic data and spend some time trying various options that might lead to the eradication of this obvious kernel bug.

I will check this page every 30 minutes for the next 18 hours.

Last edited by splurben (2024-06-02 23:00:42)

loqs · 2024-06-02 23:33:39

linux-6.9.3.arch1 with the proposed patch from post #13.
https://drive.google.com/file/d/1mjfZUY … sp=sharing linux-6.9.3.arch1-1.1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1HwkSJm … sp=sharing linux-headers-6.9.3.arch1-1.1-x86_64.pkg.tar.zst

splurben · 2024-06-02 23:37:40

loqs wrote:

linux-6.9.3.arch1 with the proposed patch from post #13.
https://drive.google.com/file/d/1mjfZUY … sp=sharing linux-6.9.3.arch1-1.1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1HwkSJm … sp=sharing linux-headers-6.9.3.arch1-1.1-x86_64.pkg.tar.zst

I'm giving these a go. I will report the results.

loqs · 2024-06-03 22:27:44

https://git.kernel.org/pub/scm/linux/ke … 0b48c86b30 this commit would be for the 240GB variant of your drive?

Last edited by loqs (2024-06-03 22:28:02)

Tkd-Alex · 2024-06-23 08:45:48

Same issue with 1TB variant, currently the temp fix is downgrade to 6.8.9

loqs · 2024-06-23 11:29:52

Tkd-Alex wrote:

Same issue with 1TB variant, currently the temp fix is downgrade to 6.8.9

Have you reported the issue upstream?
Edit:
Does aplying the following diff help?

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index e1bf8a19b3c8..b70f42e75c4b 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -4138,7 +4138,7 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = {
 
        /* Crucial devices with broken LPM support */
        { "CT500BX100SSD1",             NULL,   ATA_HORKAGE_NOLPM },
-       { "CT240BX500SSD1",             NULL,   ATA_HORKAGE_NOLPM },
+       { "CT*BX500SSD1",               NULL,   ATA_HORKAGE_NOLPM },
 
        /* 512GB MX100 with MU01 firmware has both queued TRIM and LPM issues */
        { "Crucial_CT512MX100*",        "MU01", ATA_HORKAGE_NO_NCQ_TRIM |

Last edited by loqs (2024-06-23 15:08:41)

Arch Linux

#1 2024-05-25 04:25:44

Unusual ATA errors across multiple SSD devices Kernel 6.9.1 [Solved]

#2 2024-05-25 10:41:54

Re: Unusual ATA errors across multiple SSD devices Kernel 6.9.1 [Solved]

#3 2024-05-25 14:17:01

Re: Unusual ATA errors across multiple SSD devices Kernel 6.9.1 [Solved]

#4 2024-05-26 02:40:04

Re: Unusual ATA errors across multiple SSD devices Kernel 6.9.1 [Solved]

#5 2024-05-26 04:06:33

Re: Unusual ATA errors across multiple SSD devices Kernel 6.9.1 [Solved]

#6 2024-05-26 07:45:26

Re: Unusual ATA errors across multiple SSD devices Kernel 6.9.1 [Solved]

#7 2024-05-26 07:48:31

Re: Unusual ATA errors across multiple SSD devices Kernel 6.9.1 [Solved]

#8 2024-05-26 07:56:47

Re: Unusual ATA errors across multiple SSD devices Kernel 6.9.1 [Solved]

#9 2024-05-26 08:09:17

Re: Unusual ATA errors across multiple SSD devices Kernel 6.9.1 [Solved]

#10 2024-05-26 08:15:47

Re: Unusual ATA errors across multiple SSD devices Kernel 6.9.1 [Solved]

#11 2024-05-26 09:07:46

Re: Unusual ATA errors across multiple SSD devices Kernel 6.9.1 [Solved]

#12 2024-05-26 13:45:06

Re: Unusual ATA errors across multiple SSD devices Kernel 6.9.1 [Solved]

#13 2024-05-28 21:57:10

Re: Unusual ATA errors across multiple SSD devices Kernel 6.9.1 [Solved]

#14 2024-05-29 10:33:15

Re: Unusual ATA errors across multiple SSD devices Kernel 6.9.1 [Solved]

#15 2024-05-29 10:33:52

Re: Unusual ATA errors across multiple SSD devices Kernel 6.9.1 [Solved]

#16 2024-05-29 12:16:36

Re: Unusual ATA errors across multiple SSD devices Kernel 6.9.1 [Solved]

#17 2024-06-02 21:28:29

Re: Unusual ATA errors across multiple SSD devices Kernel 6.9.1 [Solved]

#18 2024-06-02 21:38:18

Re: Unusual ATA errors across multiple SSD devices Kernel 6.9.1 [Solved]

#19 2024-06-02 21:41:17

Re: Unusual ATA errors across multiple SSD devices Kernel 6.9.1 [Solved]

#20 2024-06-02 22:57:44

Re: Unusual ATA errors across multiple SSD devices Kernel 6.9.1 [Solved]

#21 2024-06-02 23:33:39

Re: Unusual ATA errors across multiple SSD devices Kernel 6.9.1 [Solved]

#22 2024-06-02 23:37:40

Re: Unusual ATA errors across multiple SSD devices Kernel 6.9.1 [Solved]

#23 2024-06-03 22:27:44

Re: Unusual ATA errors across multiple SSD devices Kernel 6.9.1 [Solved]

#24 2024-06-23 08:45:48

Re: Unusual ATA errors across multiple SSD devices Kernel 6.9.1 [Solved]

#25 2024-06-23 11:29:52

Re: Unusual ATA errors across multiple SSD devices Kernel 6.9.1 [Solved]

Board footer