You are not logged in.
Hello,
I am replacing Windows with Arch Linux on an old Surface Pro 2 device.
I am receiving error messages from libata when I try to perform i/o on the SSD disk. The system seems to freeze, then I get the error messages, and then the disk seems to perform normally for a while before freezing again.
For instance:
# fdisk -l
[ 737.107982] ata1.00: exception Emask 0x0 SAct 0x100 SErr 0x50000 action 0x6 frozen
[ 737.108023] ata1: SError: { PHYRdyChg CommWake }
[ 737.108041] ata1.00: failed command: READ FPDMA QUEUED
[ 737.108057] ata1.00: cmd 60/08:40:00:00:00/00:00:00:00:00/40 tag 8 ncq dma 4096 in
[ 737.108057] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 737.108098] ata1.00: status: { DRDY }
[ 737.434619] I/O error, dev sda, sector 0 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
// followed by the expected `fdisk` output
The disk might be dead, however, I had no disk issues on Windows. I was using the device daily without any noticeable freezes or disk issues just before trying to switch to Linux. Unfortunately I wiped Windows so I cannot go back and check now.
I tried to boot with
libata.force=noncq
with no luck.
Here is the smartctl output:
# smartctl -a /dev/sda
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.5.9-arch2-1] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: SAMSUNG MZMPC064HBDR-000MV
Serial Number: S19ENEAD113868
LU WWN Device Id: 5 002538 043584d30
Firmware Version: CXM14M1Q
User Capacity: 64,023,257,088 bytes [64.0 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
TRIM Command: Available
Device is: Not in smartctl database 7.3/5528
ATA Version is: ACS-2 T13/2015-D revision 2
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Sat Nov 25 21:46:41 2023 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x02) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 300) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 5) minutes.
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 2626
12 Power_Cycle_Count 0x0032 097 097 000 Old_age Always - 2750
177 Wear_Leveling_Count 0x0013 093 093 000 Pre-fail Always - 225
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 093 093 010 Pre-fail Always - 100
180 Unused_Rsvd_Blk_Cnt_Tot 0x0013 093 093 010 Pre-fail Always - 1532
181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0
183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0
184 End-to-End_Error 0x0033 100 100 097 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0032 057 040 000 Old_age Always - 43
195 Hardware_ECC_Recovered 0x001a 200 200 000 Old_age Always - 0
199 UDMA_CRC_Error_Count 0x003e 253 253 000 Old_age Always - 0
235 Unknown_Attribute 0x0012 099 099 000 Old_age Always - 63
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 9635859630
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
The above only provides legacy SMART information - try 'smartctl -x' for more
I have also taken a look at possible firmware updates, but the SSD model seems quite uncommon. I am considering booting some kind of Windows trial in order to run Samsung Magician...
I would like to make 100% sure that the SSD is dead, and that this is not an issue with the kernel not handling correctly the Surface hardware. I saw nothing about disks on the linux-surface project wiki, which I'm following. If the disk is dead indeed, I would like to understand why booting on Linux suddenly killed it.
Unfortunately, I have very little knowledge of hardware handling/debugging on Linux, especially in the topic of storage drives. I would appreciate your help in exploring this issue.
Thanks in advance,
Last edited by Itms (2024-04-09 08:35:45)
Offline
I am considering booting some kind of Windows trial in order to run Samsung Magician...
I was able to reinstall Windows and, thanks to the "digital license" thing I didn't know about, I am back with a working Windows install despite having an OEM copy. I ran Samsung Magician which proposes no firmware upgrade.
I confirm that the disk works without issue under Windows. The issue comes from the Linux kernel.
Firmware Version: CXM14M1Q
I discovered that this firmware code is "blacklisted" (I do not know what this term entails) in the libata source code: https://elixir.bootlin.com/linux/v6.6.2 … re.c#L4154
Please note that the disk model associated with the firmware in libata-core.c does not match my disk model.
What can I expect about the Linux kernel support? Is there something I can test? Should I contact the libata devs?
Thanks in advance.
Offline
I'd rather look into https://wiki.archlinux.org/title/Solid_ … ted_errors
Did you get such errors when installing linux from the iso?
Online
Hi Seth, thanks for your input.
Did you get such errors when installing linux from the iso?
Sorry I was not clear, I am getting these errors from the live USB. I cannot even install Linux on the SSD. I can go through the partitioning (with a lot of patience, because the I/O regularly hangs), but pacstrap doesn't like the frequent timeouts and fails.
I'd rather look into https://wiki.archlinux.org/title/Solid_ … ted_errors
Thanks for the wiki page. I skimmed through it. It's very frustrating to see the NCQ errors that are so similar to the ones I get. I tested disabling NCQ via sysfs without rebooting, that doesn't fix the issue either.
I don't think the ALPM is related to my issue: to my knowledge there is no power saving daemon running in the live environment.
Offline
Try to add "pcie_aspm=off maxcpus=1" to the kernel parameters.
Online
This does not fix the issue. I tried the parameters you suggest, with and without libata.force=noncq, and it both cases the I/O errors are still here.
Offline
Can you post a system journal after such freeze?
sudo journalctl -b | curl -F 'file=@-' 0x0.st
Online
Can you post a system journal after such freeze?
There you go: http://0x0.st/Hxz0.txt
Thanks for your time!
Offline
Nov 28 20:54:51 archiso kernel: ata1.00: exception Emask 0x0 SAct 0x40000 SErr 0x50000 action 0x6 frozen
Nov 28 20:54:51 archiso kernel: ata1: SError: { PHYRdyChg CommWake }
Nov 28 20:54:51 archiso kernel: ata1.00: failed command: READ FPDMA QUEUED
Nov 28 20:54:51 archiso kernel: ata1.00: cmd 60/08:90:00:40:63/00:00:07:00:00/40 tag 18 ncq dma 4096 in
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 28 20:54:51 archiso kernel: ata1.00: status: { DRDY }
Nov 28 20:54:51 archiso kernel: ata1: hard resetting link
Nov 28 20:54:51 archiso kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov 28 20:54:51 archiso kernel: ata1.00: configured for UDMA/133
Nov 28 20:54:51 archiso kernel: ata1.00: device reported invalid CHS sector 0
Nov 28 20:54:51 archiso kernel: sd 0:0:0:0: [sda] tag#18 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=31s
Nov 28 20:54:51 archiso kernel: sd 0:0:0:0: [sda] tag#18 Sense Key : Illegal Request [current]
Nov 28 20:54:51 archiso kernel: sd 0:0:0:0: [sda] tag#18 Add. Sense: Unaligned write command
Nov 28 20:54:51 archiso kernel: sd 0:0:0:0: [sda] tag#18 CDB: Read(10) 28 00 07 63 40 00 00 00 08 00
Nov 28 20:54:51 archiso kernel: I/O error, dev sda, sector 123944960 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
Nov 28 20:54:51 archiso kernel: ata1: EH complete
Does it help to reduce the link speed
libata.force=3.0Gbps
Edit: ftr, "libata.force=noncq" doesn't show up in the journal
Another thing you might try is to change the SATA config in your UEFI (eg. from "ahci" to "compatible", the names might differ)
Last edited by seth (2023-11-28 21:33:42)
Online
Sorry for the delay.
ftr, "libata.force=noncq" doesn't show up in the journal
Indeed, I tested with and without, and I posted the result for the default boot configuration.
Does it help to reduce the link speed
No, it does not.
Here you will find a system journal with all the suggestions you kindly offered "pcie_aspm=off maxcpus=1 libata.force=noncq libata.force=3.0Gbps": http://0x0.st/Hx03.txt
Another thing you might try is to change the SATA config in your UEFI (eg. from "ahci" to "compatible", the names might differ)
The Surface Pro 2 does not have a UEFI interface unfortunately. What I have is barebones, Microsoft calls it standard BIOS. It does not include any SATA config.
However, I have access to the "UEFI Shell" from the Arch ISO. If it is possible to change the SATA config through this shell, please tell me how to do so, as I couldn't find a relevant command in the cmd help.
Thanks again for your help.
Offline
ncq is still enabled in that boot; "libata.force=noncq,3.0Gbps libata.noacpi"
Online
Still an error, but now the failing ata command is "READ DMA", not queued, which is probably expected.
Full system log: http://0x0.st/HxGX.txt
Offline
You could also disable DMA, but the critical problem is
ata1: SError: { PHYRdyChg CommWake }
- the device doesn't wake up and respond.
Let's try this the other way round:
initcall_blacklist=ahci_init module_blacklist=ahci,libahci
Online
Let's try this the other way round:
initcall_blacklist=ahci_init module_blacklist=ahci,libahci
Still no luck: http://0x0.st/Hx7J.txt
Offline
ahci still loads, try
initcall_blacklist=ahci_init_controller
Online
ahci still loads, try
initcall_blacklist=ahci_init_controller
ahci doesn't load now, but the issue is still here: http://0x0.st/Hxhz.txt
Offline
Yes, does.
Dec 05 21:31:15 archiso kernel: ahci 0000:00:1f.2: version 3.0
Dec 05 21:31:15 archiso kernel: ahci 0000:00:1f.2: AHCI 0001.0300 32 slots 3 ports 6 Gbps 0x1 impl SATA mode
Dec 05 21:31:15 archiso kernel: ahci 0000:00:1f.2: flags: 64bit ncq pm led clo only pio slum part deso sadm sds apst
Dec 05 21:31:15 archiso kernel: scsi host0: ahci
Dec 05 21:31:15 archiso kernel: scsi host1: ahci
Dec 05 21:31:15 archiso kernel: scsi host2: ahci
wtf can't three just be a normal ahci_init…
initcall_blacklist=ahci_init,ahci_init_one,ahci_pci_init_controller,ahci_init_controller,ahci_init_msi
Online
initcall_blacklist=ahci_init,ahci_init_one,ahci_pci_init_controller,ahci_init_controller,ahci_init_msi
Sorry to report that ahci still loads... http://0x0.st/HxC5.txt
Offline
That's gonna be fun
Try to add ",ahci_port_init,ahci_enable_ahci" to the list
Online
Try to add ",ahci_port_init,ahci_enable_ahci" to the list
Still not enough http://0x0.st/H3TO.txt
Offline
"initcall_blacklist=ahci_pci_driver_init"
Otherwise we'll have to go for "initcall_debug" (that does not belong into the initcall_blacklist list but is a parameter by itself) - it'll probably blow up the journal quite significantly.…
Online
EDITED
"initcall_blacklist=ahci_pci_driver_init"
This removed the errors.... Because the hard drive is not detected anymore (at least it doesn't show up with "fdisk -l").
Last edited by Itms (2023-12-10 10:57:42)
Offline
I first posted a celebration message when seeing the messages had disappeared, but it was too soon. I edited my previous post.
Offline
Well, we learned how to block ahci - and that the SFB2 actually uses it
You could also disable DMA
libata.dma=0
Just to make sure we're not chasing a wild goose with a broken disc:
1. do you get the same error w/ other live systems like grml, knoppix or ubuntu?
2. does windows actually still work on that device (can you re-install it or at least boot a windows install medium)?
If you just ignore the errors and install at least a bare minimum system onto the disk and then boot from the disk, does the error remain?
Online
Hi Seth, sorry for the extended absence. I have had no time to dedicate to this investigation.
1. do you get the same error w/ other live systems like grml, knoppix or ubuntu?
Yes, I get the same error with Ubuntu. I have not yet tried other smaller live CD systems such as Knoppix.
2. does windows actually still work on that device (can you re-install it or at least boot a windows install medium)?
Yes, Windows works very well, I can fully re-install without issues. I also discovered that BSD kernels work: I can install a GhostBSD system without any disk issue (alas, the network card is not recognized, so the system is not usable).
If you just ignore the errors and install at least a bare minimum system onto the disk and then boot from the disk, does the error remain?
The errors block the disk for tens of seconds every time, so it is not possible to install the system. pacstrap gives up during the process.
You could also disable DMA
This seems to alleviate the issue but errors remain: http://0x0.st/XiyQ.txt
Nov 01 06:57:01 archiso kernel: ata1: limiting SATA link speed to 1.5 Gbps
Nov 01 06:57:01 archiso kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x50000 action 0x6 frozen
Nov 01 06:57:01 archiso kernel: ata1: SError: { PHYRdyChg CommWake }
Nov 01 06:57:01 archiso kernel: ata1.00: failed command: READ MULTIPLE
Nov 01 06:57:01 archiso kernel: ata1.00: cmd c4/00:00:00:00:00/00:00:00:00:00/e0 tag 18 pio 131072 in
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 01 06:57:01 archiso kernel: ata1.00: status: { DRDY }
Nov 01 06:57:01 archiso kernel: ata1: hard resetting link
Nov 01 06:57:01 archiso kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Nov 01 06:57:01 archiso kernel: ata1.00: configured for PIO4
Nov 01 06:57:01 archiso kernel: ata1.00: device reported invalid CHS sector 0
Nov 01 06:57:01 archiso kernel: ata1: EH complete
Offline