You are not logged in.
I've been having problems since Kernel 6.9.1.
Linux fenrir 6.9.1-arch1-2 #1 SMP PREEMPT_DYNAMIC Wed, 22 May 2024 13:47:07 +0000 x86_64 GNU/Linux
I thought that my SSD was failing or needed new firmware but I discovered that the latest firmware was installed and after I swapped out the first SSD for a second one I discovered that the same errors where coming up with the exact same linux kernel along with nVidia driver problems. This error only began with Kernel 6.9.1.
«Logs showing that two different devices throw similar errors — an INTEL SSD and a CRUCIAL SSD — https://pastebin.com/H2n0uHzj»
«lshw output with second tested normally functioning SSD in place — https://pastebin.com/ZspaYvKT»
A CRUCIAL BX500 2.5 SSD 240GB CT240BX500SSD1 FW M6CR054 and an INTEL SSDSC2CT240A4 FW 335u with mirrored filesystems produce similar errors and long pauses in video playback that correspond to the errors.
It is very likely that this is due to a timeout on the
Status: DRDY :: Error ata2.00: failed command: WRITE FPDMA QUEUED
when videos are playing. This is an old server that I now use as an entertainment system.
I will also post a long bit of logs, from two days before the shift from the old SSD to the tested second ssd. https://gendersplice.com/paste/72-hours-log.txt and the
$dmesg
output https://gendersplice.com/paste/2024-05-25_dmesg.txt
I've been working in *nix since 1994 and I haven't seen this kind of crossover on divergent devices, although they are both SATA connected SSD devices.
Any ideas anyone?
Last edited by splurben (2024-05-26 07:46:36)
Offline
Which CPU does this system have?
Also you wrote that this errros began with 6.9.1, did you try if its also present on linux-lts or linux 6.8.9?
Offline
https://bbs.archlinux.org/viewtopic.php?id=296102
There've been severaly recent incidents of overoptimistic med_power_with_dipm usage…
Online
Which CPU does this system have?
Also you wrote that this errros began with 6.9.1, did you try if its also present on linux-lts or linux 6.8.9?
I have not yet tried to downgrade the kernel; I will probably get to that today.
I have also found a spare SATA cable which I will swap out today. I'd be embarrassed if it was the cable! However, I just moved house and I couldn't find a spare.
*-cpu
description: CPU
product: Intel(R) Core(TM) i7-3820 CPU @ 3.60GHz
vendor: Intel Corp.
physical id: 4
bus info: cpu@0
version: 6.45.7
slot: SOCKET 0
size: 3799MHz
capacity: 4GHz
width: 64 bits
clock: 100MHz
capabilities: lm fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp x86-64 constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm pti ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid xsaveopt dtherm ida arat pln pts vnmi md_clear flush_l1d cpufreq
configuration: cores=4 enabledcores=4 microcode=1818 threads=8
*-cache:0
description: L1 cache
physical id: 5
slot: CPU Internal L1
size: 256KiB
capacity: 256KiB
capabilities: internal write-back
configuration: level=1
*-cache:1
description: L2 cache
physical id: 6
slot: CPU Internal L2
size: 1MiB
capacity: 1MiB
capabilities: internal write-back unified
configuration: level=2
*-cache:2
description: L3 cache
physical id: 7
slot: CPU Internal L3
size: 10MiB
capacity: 10MiB
capabilities: internal write-back unified
configuration: level=3
Offline
Replaced SATA cable. No change.
Offline
Downgraded kernel and headers to 6.8.1 and downgraded nVidia to nvidia-525.147.05 and everything is fixed!
So to summarise:
nVidia drivers stopped working at or near Kernel 6.9.1 and/or nVidia version 550. Downgraded to Kernel 6.8.1 and nVidia 525 and everything works.
SATA SSD devices on SATA controller: Intel Corporation C600/X79 series chipset 6-Port SATA AHCI Controller (rev 05) had lots of problems on Kernel 6.9.1. Downgrade removed these problems as well.
Last edited by splurben (2024-05-26 07:58:57)
Offline
Obviously a temporary solution.
Offline
Did you check the ALPM condition?
Online
Did you check the ALPM condition?
There were no pacman updates for either package. I'd let it go for a few weeks as I was very busy. Nothing helped. I'll test new kernels every few weeks and see if it continues and will post here if a kernel update works better than 6.8.1, but can't have the whole computer lock up for 60 seconds every 3 minutes because of a problem with the new kernel. I'm not a kernel diagnostic technician.
Kernel 6.9.1 appears to break nVidia and some SATA SSD controllers. If anyone wants me to provide logs or other diagnostic data I'm happy to provide it. I'll be happy to go to the current kernel and provide live diagnostic data. Nonetheless, I need to be able to use my system normally and I'm not leaving my SSD and video rendition in strife otherwise.
Offline
Either what package?
V1del in the other thread linked https://wiki.archlinux.org/title/Power_ … Management and a mentioned, recently a bunch of overly optimistic med_power_with_dipm have shown up.
Edit: stray blank
grep . /sys/class/scsi_host/host*/link_power_management_policy
on a good and bad kernel.
(You can install 6.9 and LTS kernel in parallel)
There're no details on your nvidia related issues ITT but the 550xx drivers have known (and severe) bugs, there'd be https://aur.archlinux.org/packages/nvidia-535xx-dkms which should™ build w/ 6.9
Last edited by seth (2024-05-26 13:45:30)
Online
Either what package?
V1del in the other thread linked https://wiki.archlinux.org/title/Power_ … Management and a mentioned, recently a bunch of overly optimistic med_power_with_dipm have shown up.grep . /sy s/class/scsi_host/host*/link_power_management_policy
on a good and bad kernel.
(You can install 6.9 and LTS kernel in parallel)There're no details on your nvidia related issues ITT but the 550xx drivers have known (and severe) bugs, there'd be https://aur.archlinux.org/packages/nvidia-535xx-dkms which should™ build w/ 6.9
By "SATA Active Link Power Management " ALPM reference — I had never experienced this problem until 6.9.1 and the reference post has not been revised as SOLVED.
There is no "link_power_management_policy" file assigned to
lrwxrwxrwx 1 root root 0 May 26 14:18 host6 -> ../../devices/pci0000:00/0000:00:02.0/0000:02:00.0/host6/scsi_host/host6
and the default in Arch is supposed to be 'Maximum Performance' not med_power_with_dipm in the absence of this configuration file.
The section in question is also labelled as:
I'd tried nvidia-535xx-dkms while on kernel 6.9.1 and I still had failed module loading until downgrade to 6.8.1 and nVidia 525xx-dkms.
nVidia Downgrade to 525xx fixed that problem instantly once the kernel was downgraded and not before.
These are not 'isolation diagnostics'. This is not a test system. I'm using it in a non-mission-critical functional situation. nVidia 550xx drivers are definitely completely screwed and Kernel 6.9.1 has some revised or stripped content that I'm not qualified to comment upon except to say: "It doesn't work with some SATA SSD hardware scenarios and definitely doesn't work at all with a lot of nVidia video adapters."
As I said, I'm happy to install latest kernel side-by-side and provide diagnostic data based on the latest kernel being booted and upgrade nVidia drivers for the same purposes, but I can only afford to do that on Sundays or Mondays AWST_UTC+8 Perth time as these are my days off.
Last edited by splurben (2024-05-26 09:09:37)
Offline
Random broad claims in the wiki are not authorative against the data provided by the actual system. As mentioned, several systems have shown up with bogus defaults (NOT configured locally) to med_power_with_dipm
Tbc, you don't *have* to do anything, but as you correctly figured, downgrading is "Obviously a temporary solution" and if you want to get an opinion what what you *could* do it's to actually run that grep because the error has all the hallmarks of a PM issue (idk. what your screenshow is meant to prove but see the second entry in the blue note at the bottom of the paragraph)
Fwwi and if there's *really* (don't believe, check) no ALPM support on your system, there's also been https://bbs.archlinux.org/viewtopic.php … 1#p2173651 which *may* cause this.
Online
From the upstream kernel developer responsible for ATA in response to a similar issue:
We recently (kernel v6.9) enabled LPM for all AHCI controllers if:
-The AHCI controller reports that it supports LPM, and
-The drive reports that it supports LPM (DIPM), and
-CONFIG_SATA_MOBILE_LPM_POLICY=3, and
-The port is not defined as external in the per port PxCMD register, and
-The port is not defined as hotplug capable in the per port PxCMD register.However, there appears to be some drives (usually cheap ones that we've never
heard about) that reports that they support DIPM, but when actually turning
it on, they stop working.
Do you want help contacting upstream or devising a quirk for your system?
Edit:
There is already an entry for a BX100:
/* Crucial BX100 SSD 500GB has broken LPM support */
{ "CT500BX100SSD1", NULL, ATA_HORKAGE_NOLPM },
You could base a quirk for your CT240BX500SSD1 on it such as:
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index c449d60d9bb9..73c9c43cfcc0 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -4183,6 +4183,9 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = {
/* Crucial BX100 SSD 500GB has broken LPM support */
{ "CT500BX100SSD1", NULL, ATA_HORKAGE_NOLPM },
+ /* Crucial BX500 SSD 500GB has broken LPM support */
+ { "CT500BX500SSD1", NULL, ATA_HORKAGE_NOLPM },
+
/* 512GB MX100 with MU01 firmware has both queued TRIM and LPM issues */
{ "Crucial_CT512MX100*", "MU01", ATA_HORKAGE_NO_NCQ_TRIM |
ATA_HORKAGE_ZERO_AFTER_TRIM |
Last edited by loqs (2024-05-28 22:15:04)
Online
Random broad claims in the wiki are not authorative against the data provided by the actual system. As mentioned, several systems have shown up with bogus defaults (NOT configured locally) to med_power_with_dipm
Tbc, you don't *have* to do anything, but as you correctly figured, downgrading is "Obviously a temporary solution" and if you want to get an opinion what what you *could* do it's to actually run that grep because the error has all the hallmarks of a PM issue (idk. what your screenshow is meant to prove but see the second entry in the blue note at the bottom of the paragraph)
Fwwi and if there's *really* (don't believe, check) no ALPM support on your system, there's also been https://bbs.archlinux.org/viewtopic.php … 1#p2173651 which *may* cause this.
Thank you for your candour!
I will investigate these new options at my next opportunity. I greatly appreciate your attention and interest.
Offline
From the upstream kernel developer responsible for ATA in response to a similar issue:
Niklas Cassel wrote:We recently (kernel v6.9) enabled LPM for all AHCI controllers if:
-The AHCI controller reports that it supports LPM, and
-The drive reports that it supports LPM (DIPM), and
-CONFIG_SATA_MOBILE_LPM_POLICY=3, and
-The port is not defined as external in the per port PxCMD register, and
-The port is not defined as hotplug capable in the per port PxCMD register.However, there appears to be some drives (usually cheap ones that we've never
heard about) that reports that they support DIPM, but when actually turning
it on, they stop working.Do you want help contacting upstream or devising a quirk for your system?
Edit:
There is already an entry for a BX100:/* Crucial BX100 SSD 500GB has broken LPM support */ { "CT500BX100SSD1", NULL, ATA_HORKAGE_NOLPM },
You could base a quirk for your CT240BX500SSD1 on it such as:
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index c449d60d9bb9..73c9c43cfcc0 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -4183,6 +4183,9 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = { /* Crucial BX100 SSD 500GB has broken LPM support */ { "CT500BX100SSD1", NULL, ATA_HORKAGE_NOLPM }, + /* Crucial BX500 SSD 500GB has broken LPM support */ + { "CT500BX500SSD1", NULL, ATA_HORKAGE_NOLPM }, + /* 512GB MX100 with MU01 firmware has both queued TRIM and LPM issues */ { "Crucial_CT512MX100*", "MU01", ATA_HORKAGE_NO_NCQ_TRIM | ATA_HORKAGE_ZERO_AFTER_TRIM |
Thank you for your candour!
I will investigate these options at my next opportunity. I greatly appreciate your attention and interest.
Offline
Would it help if I built a kernel for you with the proposed quirk applied?
Online
I'm curious that this isn't considered top-shelf importance for a kernel revision. An unmodified kernel default is invalid. Two revisions of the kernel still contain this invalidity. I'm sure only 30% of users are on older hardware with SATA SSDs
So, Crucial and Intel are considered cheap no-name brands of SSD? Interesting.
I will enable hot-plug on my SSD, according to the previous posts that should remove this bug that only affects "Cheap no-name brands like INTEL and CRUCIAL."
My first chance to look at this is my husband's 56th Birthday and our 25th Anniversary. So, not spending hours applying modifications. Not sure why this isn't important to upstream.
By the way, these are my SAS attached BTRFS Raid, the INTEL SATA SSD is /sys/class/scsi_host/host6 and doesn't report med_power_with_dipm, hmmm:
~]$ grep . /sys/class/scsi_host/host*/link_power_management_policy
/sys/class/scsi_host/host0/link_power_management_policy:med_power_with_dipm
/sys/class/scsi_host/host1/link_power_management_policy:med_power_with_dipm
/sys/class/scsi_host/host2/link_power_management_policy:med_power_with_dipm
/sys/class/scsi_host/host3/link_power_management_policy:med_power_with_dipm
/sys/class/scsi_host/host4/link_power_management_policy:med_power_with_dipm
/sys/class/scsi_host/host5/link_power_management_policy:med_power_with_dipm
What else can I post to help with an integrated fix? Or, are we to understand it is simply not acceptable to use INTEL SSD SATA attached SSD storage in any case?
Last edited by splurben (2024-06-02 22:24:49)
Offline
I'm curious that this isn't considered top-shelf importance for a kernel revision. An unmodified kernel default is invalid. Two revisions of the kernel still contain this invalidity. I'm sure only 30% of users are on older hardware with SATA SSDs
Have you reported your device to the upstream kernel developers so a quirk for the device can be added?
Online
The problem isn't all sata ssd's but
some drives (usually cheap ones that we've never heard about) that reports that they support DIPM, but when actually turning it on, they stop working
so this needs quirks for broken hardware and those need to be reported to be added.
So, not spending hours applying modifications.
Add the rule in https://wiki.archlinux.org/title/Power_ … Management but of course use medium_power instead of med_power_with_dipm
Depending on how good you are at exiting vim, this is just gonna take a couple of seconds.
Happy birthday to your husband - and condolences for being married
Online
The problem isn't all sata ssd's but
some drives (usually cheap ones that we've never heard about) that reports that they support DIPM, but when actually turning it on, they stop working
so this needs quirks for broken hardware and those need to be reported to be added.
So, not spending hours applying modifications.
Add the rule in https://wiki.archlinux.org/title/Power_ … Management but of course use medium_power instead of med_power_with_dipm
Depending on how good you are at exiting vim, this is just gonna take a couple of seconds.Happy birthday to your husband - and condolences for being married
Happily Married for 25 years. I understand that's rare. It was love at first sight as well which confounds many people. I vividly remember the exact moment I first locked eyes with my husband; I understand that fact is also rare and contentious for many people.
I understand this is only for us old guys (56) that have only been trained on and using UNIX since the early 90s and like to keep functioning usable hardware going; this system is 10 years old. I crashed my dad's INTERDATA 70 in 1972 at the age of 4 by pressing the big red button. Dad used to 'repair' punch-cards by cutting tiny bits of tape and replacing punch-card chads to fix octal and hexadecimal errors in machine code as the progenitor Director of Computer Science at Western Washington University on the ridiculously huge IBM 370 they had there.
However, the INTEL and CRUCIAL SSDs tested are relatively new (2-3 years) and are certainly not cheap no-name brands and they have their latest firmware and report they are well within their normal life-span. Also, as you might see above, the only drives on this system that are reporting "med_power_with_dipm" are the six 4TB conventional Seagate HDD drives attached with a PCIe SAS card delivering 312MB/s performance on a 10-year-old machine, the SSD is not reporting DIPM as it's /sys/class/scsi_host/host6. So, unless, the MSI MB is buggy (and I'm not averse to this consideration as MSI don't claim to give a shit about *NIX) the fundamental criticism of this argument is mute.
I can't change the setting for a drive that doesn't report it's employing the DIPM setting. There's something else going on here that started with kernel 6.9 at some stage.
I'm not working today as I now live in Western Australia and today is a Public Holiday. I am absolutely happy to provide further diagnostic data and spend some time trying various options that might lead to the eradication of this obvious kernel bug.
I will check this page every 30 minutes for the next 18 hours.
Last edited by splurben (2024-06-02 23:00:42)
Offline
linux-6.9.3.arch1 with the proposed patch from post #13.
https://drive.google.com/file/d/1mjfZUY … sp=sharing linux-6.9.3.arch1-1.1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1HwkSJm … sp=sharing linux-headers-6.9.3.arch1-1.1-x86_64.pkg.tar.zst
Online
linux-6.9.3.arch1 with the proposed patch from post #13.
https://drive.google.com/file/d/1mjfZUY … sp=sharing linux-6.9.3.arch1-1.1-x86_64.pkg.tar.zst
https://drive.google.com/file/d/1HwkSJm … sp=sharing linux-headers-6.9.3.arch1-1.1-x86_64.pkg.tar.zst
I'm giving these a go. I will report the results.
Offline
https://git.kernel.org/pub/scm/linux/ke … 0b48c86b30 this commit would be for the 240GB variant of your drive?
Last edited by loqs (2024-06-03 22:28:02)
Online
Same issue with 1TB variant, currently the temp fix is downgrade to 6.8.9
Offline
Same issue with 1TB variant, currently the temp fix is downgrade to 6.8.9
Have you reported the issue upstream?
Edit:
Does aplying the following diff help?
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index e1bf8a19b3c8..b70f42e75c4b 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -4138,7 +4138,7 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = {
/* Crucial devices with broken LPM support */
{ "CT500BX100SSD1", NULL, ATA_HORKAGE_NOLPM },
- { "CT240BX500SSD1", NULL, ATA_HORKAGE_NOLPM },
+ { "CT*BX500SSD1", NULL, ATA_HORKAGE_NOLPM },
/* 512GB MX100 with MU01 firmware has both queued TRIM and LPM issues */
{ "Crucial_CT512MX100*", "MU01", ATA_HORKAGE_NO_NCQ_TRIM |
Last edited by loqs (2024-06-23 15:08:41)
Online