You are not logged in.
Hi there! My old laptop is acting as a server for a few things and after booting into an Arch ISO to try to run `badblocks` and force spare sectors to switch in (I noticed plenty of read errors), I'm seeing what I think has been the cause all along:
```
ata1.00: failed command: READ FPDMA QUEUED
cmd 60/08:78:00:a5:b3/00:00:71:00:00/40 tag 15 ncq dma 4096 in
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1: reset failed, giving up
I/O error, dev sda, sector 1907598592 op 0x0:(READ) flags 0x800 phys_seg 1 prio class 0
I/O error, dev sda, sector 1907599840 op 0x0:(READ) flags 0x800 phys_seg 1 prio class 0
...
Buffer I/O error on dev dm-2, logical block 2128880, async page read
...
```
What's the likely cause? This is on an HP by-0062st. Do I need to replace the drive, and is there a temporary fix in the meantime?
Offline
Can you provide the full `dmesg` output from a boot with the issue and also run a S.M.A.R.T. self test on the drive. When the test has completed please post the output of `smartctl -a /dev/sda`.
Offline
How do we upload files here?
Offline
See the examples from Patsebin without a dedicated client.
Last edited by loqs (2024-05-29 00:50:36)
Offline
[dmesg](http://0x0.st/XNAg.txt)
smartctl fails:
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.8.2-arch2-1] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
Smartctl open device: /dev/sda failed: INQUIRY failedOffline
Wonder if the ST1000LM035-1RK1 needs a quirk like the ST1000LM024 HN-M101MBB https://github.com/torvalds/linux/blob/ … re.c#L4126
Offline
I mean, this system has been operating fine for years, would it really suddenly happen like this if that's the issue?
Offline
I mean, this system has been operating fine for years, would it really suddenly happen like this if that's the issue?
Yes if the system had just been updated to a linux 6.9 based kernel.
Offline
I have no idea if it has been updated as I'm just replicating this issue on a live USB... is there an older Arch ISO version I could use with 6.8 to test maybe?
Offline
Offline
Which version should I use? Anything from before May?
Offline
Wait, this system already is 6.8 right? The logs show "Linux version 6.8.2-arch2-1 (linux@archlinux)"
Offline
What logs show what when where.
Try "uname -a"…
Offline
The dmesg output sent in #5: http://0x0.st/XNAg.txt
The first line shows the kernel version as 6.8. I also know for a fact the non-live system wasn't updated that recently.
Offline
That log is the april install iso on 6.8, yes.
Can you post "lspci -tvnn"? There's certainly some bus issue (see the rtw_8821ce noise)
Offline
That noise in the logs is probably caused by my rtl8821ce network card which is not supported directly in the kernel. I hardwire this system, so I don't usually worry to install the driver. Here's the results of lspci -tvnn: http://0x0.st/XN3g.txt
Offline
[ 690.803423] pcieport 0000:00:1c.5: AER: Multiple Correctable error message received from 0000:02:00.0
[ 690.803460] rtw_8821ce 0000:02:00.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
[ 690.803470] rtw_8821ce 0000:02:00.0: device [10ec:c821] error status/mask=00000001/0000e000
[ 690.803480] rtw_8821ce 0000:02:00.0: [ 0] RxErr (First)are not driver issues, there's interference on the bus signal.
Can you disable the wifi chip?
Your nvme is on the same bus and the ata controller is adjacent.
Have you ever opened the case and fumbled around inside?
Offline
The last time I opened the case was to install the NVME disk and upgrade the RAM. Should I open it up and reseat something?
This is an HP by-0062st if that helps
Offline
Have you tried to disable the wifi nic?
Has the system ever worked properly after installing the nvme (which is on the same bus as the error yelling wifi and itself doesn't properly show up either)?
Offline
Yes, the system has worked fine for years. This install was made about a year ago.
No, I have not tried to disable anything like that; I do not remember having seen these floods of errors in syslogs ever in the past either, I feel I would have noticed.
Here's how the problem showed up:
- A few days ago, the system remounted read/only (uptime to this point was around 30 days). Status monitoring caught on to all of these outages. I could not easily read historical logs in dmesg because I wasn't sure how to turn down the kernel loglevel to allow my pager to work properly, so I synced and then reset via SysRq.
- The problem occurred again the next day, again with just a wall of I/O errors. On my next attempt to boot, GRUB failed with some sort of allocator magic error.
- I then booted into an Arch USB and pulled the dmesg logs you're seeing.
Offline
I'd have a look at the hardware/cables/connectors again, possibly remove the nvme, disable/remove the wifi chip (if possible) and test the ATA drive in a different host - the degradation and grub issues suggest it's the harware but the error messages and smart failure seem more like a problem w/ the bus than the disk.
Offline
The disk appeared to load successfully in another system, but timeouts began occurring when attempting to shut it down: something about the FLUSH command.
After removing the CD drive, the NIC, and the M.2 SSD, the error still occurs. dmesg: http://0x0.st/XN7h.txt
Offline
something about the FLUSH command.
Do you have those?
THis looks increasingly like the disk is dying.
Offline
Here's what I was able to capture from putting the disk into another machine (Linux 6.8.8-arch1-1 #1 SMP PREEMPT_DYNAMIC x86_64 GNU/Linux): http://imgur.com/a/XbiNKy4
Should I try removing the HDD and booting with just the SSD in? I can reinstall; I have very strong backups so it'd be some work to restore but possible.
Offline
Because imgur happened to go down the second I sent that:
[ OK ] Stopped User Login Management.
ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata3.00: failed command: FLUSH CACHE EXT
ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 14
res 40/00:01:09:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ *] A stop job is running for User Manager for UID 1000 (43s / 2min)Last edited by 0xlogn (2024-06-01 15:46:03)
Offline