You are not logged in.

#1 2023-11-25 22:20:01

Itms
Member
Registered: 2023-11-25
Posts: 17

[SOLVED] libata errors when accessing the SSD on a Surface Pro 2

Hello,
I am replacing Windows with Arch Linux on an old Surface Pro 2 device.

I am receiving error messages from libata when I try to perform i/o on the SSD disk. The system seems to freeze, then I get the error messages, and then the disk seems to perform normally for a while before freezing again.

For instance:

# fdisk -l
[  737.107982] ata1.00: exception Emask 0x0 SAct 0x100 SErr 0x50000 action 0x6 frozen
[  737.108023] ata1: SError: { PHYRdyChg CommWake }
[  737.108041] ata1.00: failed command: READ FPDMA QUEUED
[  737.108057] ata1.00: cmd 60/08:40:00:00:00/00:00:00:00:00/40 tag 8 ncq dma 4096 in
[  737.108057]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  737.108098] ata1.00: status: { DRDY }
[  737.434619] I/O error, dev sda, sector 0 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
// followed by the expected `fdisk` output

The disk might be dead, however, I had no disk issues on Windows. I was using the device daily without any noticeable freezes or disk issues just before trying to switch to Linux. Unfortunately I wiped Windows so I cannot go back and check now.

I tried to boot with

libata.force=noncq

with no luck.

Here is the smartctl output:

# smartctl -a /dev/sda
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.5.9-arch2-1] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     SAMSUNG MZMPC064HBDR-000MV
Serial Number:    S19ENEAD113868
LU WWN Device Id: 5 002538 043584d30
Firmware Version: CXM14M1Q
User Capacity:    64,023,257,088 bytes [64.0 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
TRIM Command:     Available
Device is:        Not in smartctl database 7.3/5528
ATA Version is:   ACS-2 T13/2015-D revision 2
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sat Nov 25 21:46:41 2023 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(  300) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (   5) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       2626
 12 Power_Cycle_Count       0x0032   097   097   000    Old_age   Always       -       2750
177 Wear_Leveling_Count     0x0013   093   093   000    Pre-fail  Always       -       225
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   093   093   010    Pre-fail  Always       -       100
180 Unused_Rsvd_Blk_Cnt_Tot 0x0013   093   093   010    Pre-fail  Always       -       1532
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   057   040   000    Old_age   Always       -       43
195 Hardware_ECC_Recovered  0x001a   200   200   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x003e   253   253   000    Old_age   Always       -       0
235 Unknown_Attribute       0x0012   099   099   000    Old_age   Always       -       63
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       9635859630

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more

I have also taken a look at possible firmware updates, but the SSD model seems quite uncommon. I am considering booting some kind of Windows trial in order to run Samsung Magician...

I would like to make 100% sure that the SSD is dead, and that this is not an issue with the kernel not handling correctly the Surface hardware. I saw nothing about disks on the linux-surface project wiki, which I'm following. If the disk is dead indeed, I would like to understand why booting on Linux suddenly killed it.

Unfortunately, I have very little knowledge of hardware handling/debugging on Linux, especially in the topic of storage drives. I would appreciate your help in exploring this issue.
Thanks in advance,

Last edited by Itms (2024-04-09 08:35:45)

Offline

#2 2023-11-26 10:28:49

Itms
Member
Registered: 2023-11-25
Posts: 17

Re: [SOLVED] libata errors when accessing the SSD on a Surface Pro 2

Itms wrote:

I am considering booting some kind of Windows trial in order to run Samsung Magician...

I was able to reinstall Windows and, thanks to the "digital license" thing I didn't know about, I am back with a working Windows install despite having an OEM copy. I ran Samsung Magician which proposes no firmware upgrade.

I confirm that the disk works without issue under Windows. The issue comes from the Linux kernel.

Firmware Version: CXM14M1Q

I discovered that this firmware code is "blacklisted" (I do not know what this term entails) in the libata source code: https://elixir.bootlin.com/linux/v6.6.2 … re.c#L4154

Please note that the disk model associated with the firmware in libata-core.c does not match my disk model.

What can I expect about the Linux kernel support? Is there something I can test? Should I contact the libata devs?

Thanks in advance.

Offline

#3 2023-11-26 17:51:53

seth
Member
Registered: 2012-09-03
Posts: 52,545

Re: [SOLVED] libata errors when accessing the SSD on a Surface Pro 2

I'd rather look into https://wiki.archlinux.org/title/Solid_ … ted_errors
Did you get such errors when installing linux from the iso?

Offline

#4 2023-11-27 21:30:58

Itms
Member
Registered: 2023-11-25
Posts: 17

Re: [SOLVED] libata errors when accessing the SSD on a Surface Pro 2

Hi Seth, thanks for your input.

seth wrote:

Did you get such errors when installing linux from the iso?

Sorry I was not clear, I am getting these errors from the live USB. I cannot even install Linux on the SSD. I can go through the partitioning (with a lot of patience, because the I/O regularly hangs), but pacstrap doesn't like the frequent timeouts and fails.

Thanks for the wiki page. I skimmed through it. It's very frustrating to see the NCQ errors that are so similar to the ones I get. I tested disabling NCQ via sysfs without rebooting, that doesn't fix the issue either.
I don't think the ALPM is related to my issue: to my knowledge there is no power saving daemon running in the live environment.

Offline

#5 2023-11-27 21:54:08

seth
Member
Registered: 2012-09-03
Posts: 52,545

Re: [SOLVED] libata errors when accessing the SSD on a Surface Pro 2

Try to add "pcie_aspm=off maxcpus=1"  to the kernel parameters.

Offline

#6 2023-11-28 07:59:35

Itms
Member
Registered: 2023-11-25
Posts: 17

Re: [SOLVED] libata errors when accessing the SSD on a Surface Pro 2

This does not fix the issue. I tried the parameters you suggest, with and without libata.force=noncq, and it both cases the I/O errors are still here.

Offline

#7 2023-11-28 14:15:25

seth
Member
Registered: 2012-09-03
Posts: 52,545

Re: [SOLVED] libata errors when accessing the SSD on a Surface Pro 2

Can you post a system journal after such freeze?

sudo journalctl -b | curl -F 'file=@-' 0x0.st

Offline

#8 2023-11-28 20:15:30

Itms
Member
Registered: 2023-11-25
Posts: 17

Re: [SOLVED] libata errors when accessing the SSD on a Surface Pro 2

seth wrote:

Can you post a system journal after such freeze?

There you go: http://0x0.st/Hxz0.txt
Thanks for your time!

Offline

#9 2023-11-28 21:29:51

seth
Member
Registered: 2012-09-03
Posts: 52,545

Re: [SOLVED] libata errors when accessing the SSD on a Surface Pro 2

Nov 28 20:54:51 archiso kernel: ata1.00: exception Emask 0x0 SAct 0x40000 SErr 0x50000 action 0x6 frozen
Nov 28 20:54:51 archiso kernel: ata1: SError: { PHYRdyChg CommWake }
Nov 28 20:54:51 archiso kernel: ata1.00: failed command: READ FPDMA QUEUED
Nov 28 20:54:51 archiso kernel: ata1.00: cmd 60/08:90:00:40:63/00:00:07:00:00/40 tag 18 ncq dma 4096 in
                                         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 28 20:54:51 archiso kernel: ata1.00: status: { DRDY }
Nov 28 20:54:51 archiso kernel: ata1: hard resetting link
Nov 28 20:54:51 archiso kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov 28 20:54:51 archiso kernel: ata1.00: configured for UDMA/133
Nov 28 20:54:51 archiso kernel: ata1.00: device reported invalid CHS sector 0
Nov 28 20:54:51 archiso kernel: sd 0:0:0:0: [sda] tag#18 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=31s
Nov 28 20:54:51 archiso kernel: sd 0:0:0:0: [sda] tag#18 Sense Key : Illegal Request [current] 
Nov 28 20:54:51 archiso kernel: sd 0:0:0:0: [sda] tag#18 Add. Sense: Unaligned write command
Nov 28 20:54:51 archiso kernel: sd 0:0:0:0: [sda] tag#18 CDB: Read(10) 28 00 07 63 40 00 00 00 08 00
Nov 28 20:54:51 archiso kernel: I/O error, dev sda, sector 123944960 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
Nov 28 20:54:51 archiso kernel: ata1: EH complete

Does it help to reduce the link speed

libata.force=3.0Gbps

Edit: ftr, "libata.force=noncq" doesn't show up in the journal
Another thing you might try is to change the SATA config in your UEFI (eg. from "ahci" to "compatible", the names might differ)

Last edited by seth (2023-11-28 21:33:42)

Offline

#10 2023-12-04 20:28:46

Itms
Member
Registered: 2023-11-25
Posts: 17

Re: [SOLVED] libata errors when accessing the SSD on a Surface Pro 2

Sorry for the delay.

seth wrote:

ftr, "libata.force=noncq" doesn't show up in the journal

Indeed, I tested with and without, and I posted the result for the default boot configuration.

seth wrote:

Does it help to reduce the link speed

No, it does not.

Here you will find a system journal with all the suggestions you kindly offered "pcie_aspm=off maxcpus=1 libata.force=noncq libata.force=3.0Gbps": http://0x0.st/Hx03.txt

seth wrote:

Another thing you might try is to change the SATA config in your UEFI (eg. from "ahci" to "compatible", the names might differ)

The Surface Pro 2 does not have a UEFI interface unfortunately. What I have is barebones, Microsoft calls it standard BIOS. It does not include any SATA config.

However, I have access to the "UEFI Shell" from the Arch ISO. If it is possible to change the SATA config through this shell, please tell me how to do so, as I couldn't find a relevant command in the cmd help.

Thanks again for your help.

Offline

#11 2023-12-04 21:08:26

seth
Member
Registered: 2012-09-03
Posts: 52,545

Re: [SOLVED] libata errors when accessing the SSD on a Surface Pro 2

ncq is still enabled in that boot; "libata.force=noncq,3.0Gbps libata.noacpi"

Offline

#12 2023-12-04 21:35:26

Itms
Member
Registered: 2023-11-25
Posts: 17

Re: [SOLVED] libata errors when accessing the SSD on a Surface Pro 2

Still an error, but now the failing ata command is "READ DMA", not queued, which is probably expected.

Full system log: http://0x0.st/HxGX.txt

Offline

#13 2023-12-05 07:40:04

seth
Member
Registered: 2012-09-03
Posts: 52,545

Re: [SOLVED] libata errors when accessing the SSD on a Surface Pro 2

You could also disable DMA, but the critical problem is

ata1: SError: { PHYRdyChg CommWake }

- the device doesn't wake up and respond.

Let's try this the other way round:

initcall_blacklist=ahci_init module_blacklist=ahci,libahci

Offline

#14 2023-12-05 20:39:51

Itms
Member
Registered: 2023-11-25
Posts: 17

Re: [SOLVED] libata errors when accessing the SSD on a Surface Pro 2

seth wrote:

Let's try this the other way round:

initcall_blacklist=ahci_init module_blacklist=ahci,libahci

Still no luck: http://0x0.st/Hx7J.txt

Offline

#15 2023-12-05 21:09:31

seth
Member
Registered: 2012-09-03
Posts: 52,545

Re: [SOLVED] libata errors when accessing the SSD on a Surface Pro 2

ahci still loads, try

initcall_blacklist=ahci_init_controller

Offline

#16 2023-12-05 21:37:36

Itms
Member
Registered: 2023-11-25
Posts: 17

Re: [SOLVED] libata errors when accessing the SSD on a Surface Pro 2

seth wrote:

ahci still loads, try

initcall_blacklist=ahci_init_controller

ahci doesn't load now, but the issue is still here: http://0x0.st/Hxhz.txt

Offline

#17 2023-12-06 07:35:35

seth
Member
Registered: 2012-09-03
Posts: 52,545

Re: [SOLVED] libata errors when accessing the SSD on a Surface Pro 2

Yes, does.

Dec 05 21:31:15 archiso kernel: ahci 0000:00:1f.2: version 3.0
Dec 05 21:31:15 archiso kernel: ahci 0000:00:1f.2: AHCI 0001.0300 32 slots 3 ports 6 Gbps 0x1 impl SATA mode
Dec 05 21:31:15 archiso kernel: ahci 0000:00:1f.2: flags: 64bit ncq pm led clo only pio slum part deso sadm sds apst 
Dec 05 21:31:15 archiso kernel: scsi host0: ahci
Dec 05 21:31:15 archiso kernel: scsi host1: ahci
Dec 05 21:31:15 archiso kernel: scsi host2: ahci

wtf can't three just be a normal ahci_init…

initcall_blacklist=ahci_init,ahci_init_one,ahci_pci_init_controller,ahci_init_controller,ahci_init_msi

Offline

#18 2023-12-06 09:11:50

Itms
Member
Registered: 2023-11-25
Posts: 17

Re: [SOLVED] libata errors when accessing the SSD on a Surface Pro 2

seth wrote:
initcall_blacklist=ahci_init,ahci_init_one,ahci_pci_init_controller,ahci_init_controller,ahci_init_msi

Sorry to report that ahci still loads... sad http://0x0.st/HxC5.txt

Offline

#19 2023-12-06 13:25:16

seth
Member
Registered: 2012-09-03
Posts: 52,545

Re: [SOLVED] libata errors when accessing the SSD on a Surface Pro 2

That's gonna be fun sad
Try to add ",ahci_port_init,ahci_enable_ahci" to the list

Offline

#20 2023-12-08 11:36:53

Itms
Member
Registered: 2023-11-25
Posts: 17

Re: [SOLVED] libata errors when accessing the SSD on a Surface Pro 2

seth wrote:

Try to add ",ahci_port_init,ahci_enable_ahci" to the list

Still not enough sad sad http://0x0.st/H3TO.txt

Offline

#21 2023-12-08 15:16:25

seth
Member
Registered: 2012-09-03
Posts: 52,545

Re: [SOLVED] libata errors when accessing the SSD on a Surface Pro 2

"initcall_blacklist=ahci_pci_driver_init"
Otherwise we'll have to go for "initcall_debug" (that does not belong into the initcall_blacklist list but is a parameter by itself) - it'll probably blow up the journal quite significantly.…

Offline

#22 2023-12-10 10:49:04

Itms
Member
Registered: 2023-11-25
Posts: 17

Re: [SOLVED] libata errors when accessing the SSD on a Surface Pro 2

EDITED

seth wrote:

"initcall_blacklist=ahci_pci_driver_init"

This removed the errors.... Because the hard drive is not detected anymore (at least it doesn't show up with "fdisk -l").

Last edited by Itms (2023-12-10 10:57:42)

Offline

#23 2023-12-10 10:58:38

Itms
Member
Registered: 2023-11-25
Posts: 17

Re: [SOLVED] libata errors when accessing the SSD on a Surface Pro 2

I first posted a celebration message when seeing the messages had disappeared, but it was too soon. I edited my previous post.

Offline

#24 2023-12-10 20:50:07

seth
Member
Registered: 2012-09-03
Posts: 52,545

Re: [SOLVED] libata errors when accessing the SSD on a Surface Pro 2

Well, we learned how to block ahci - and that the SFB2 actually uses it sad

seth wrote:

You could also disable DMA

libata.dma=0

Just to make sure we're not chasing a wild goose with a broken disc:
1. do you get the same error w/ other live systems like grml, knoppix or ubuntu?
2. does windows actually still work on that device (can you re-install it or at least boot a windows install medium)?

If you just ignore the errors and install at  least a bare minimum system onto the disk and then boot from the disk, does the error remain?

Offline

#25 2024-04-08 13:58:01

Itms
Member
Registered: 2023-11-25
Posts: 17

Re: [SOLVED] libata errors when accessing the SSD on a Surface Pro 2

Hi Seth, sorry for the extended absence. I have had no time to dedicate to this investigation.

seth wrote:

1. do you get the same error w/ other live systems like grml, knoppix or ubuntu?

Yes, I get the same error with Ubuntu. I have not yet tried other smaller live CD systems such as Knoppix.

seth wrote:

2. does windows actually still work on that device (can you re-install it or at least boot a windows install medium)?

Yes, Windows works very well, I can fully re-install without issues. I also discovered that BSD kernels work: I can install a GhostBSD system without any disk issue (alas, the network card is not recognized, so the system is not usable).

seth wrote:

If you just ignore the errors and install at  least a bare minimum system onto the disk and then boot from the disk, does the error remain?

The errors block the disk for tens of seconds every time, so it is not possible to install the system. pacstrap gives up during the process.

seth wrote:

You could also disable DMA

This seems to alleviate the issue but errors remain: http://0x0.st/XiyQ.txt

Nov 01 06:57:01 archiso kernel: ata1: limiting SATA link speed to 1.5 Gbps
Nov 01 06:57:01 archiso kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x50000 action 0x6 frozen
Nov 01 06:57:01 archiso kernel: ata1: SError: { PHYRdyChg CommWake }
Nov 01 06:57:01 archiso kernel: ata1.00: failed command: READ MULTIPLE
Nov 01 06:57:01 archiso kernel: ata1.00: cmd c4/00:00:00:00:00/00:00:00:00:00/e0 tag 18 pio 131072 in
                                         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 01 06:57:01 archiso kernel: ata1.00: status: { DRDY }
Nov 01 06:57:01 archiso kernel: ata1: hard resetting link
Nov 01 06:57:01 archiso kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Nov 01 06:57:01 archiso kernel: ata1.00: configured for PIO4
Nov 01 06:57:01 archiso kernel: ata1.00: device reported invalid CHS sector 0
Nov 01 06:57:01 archiso kernel: ata1: EH complete

Offline

Board footer

Powered by FluxBB