You are not logged in.
Pages: 1
After booting (and only after booting) my system sometimes hang when opening e.g. a browser or when clicking on a mail the first time in thunderbird. Hang means i could move the mouse pointer around but i could do nothing else, can't click on program starters and keyboard shortcuts won't work. This situation last some seconds. After that the system works without any noticeable delay and works normally. Looking at dmesg after the hang:
[ 80.212749] ata3.00: failed command: WRITE FPDMA QUEUED
[ 80.212750] ata3.00: cmd 61/e8:f8:20:87:8a/00:00:05:00:00/40 tag 31 ncq dma 118784 out
res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 80.212752] ata3.00: status: { DRDY }
[... some more of this ...]
[ 80.212754] ata3: hard resetting link
[ 80.675968] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 80.683338] ata3.00: configured for UDMA/133
[ 80.693637] ahci 0000:01:00.1: port does not support device sleep
[ 80.693748] sd 2:0:0:0: [sdb] tag#5 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=34s
[ 80.693751] sd 2:0:0:0: [sdb] tag#5 CDB: Read(10) 28 00 02 f2 38 00 00 06 78 00
[ 80.693753] I/O error, dev sdb, sector 49428480 op 0x0:(READ) flags 0x80700 phys_seg 19 prio class 0
[ 80.693766] sd 2:0:0:0: [sdb] tag#6 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=34s
[ 80.693767] sd 2:0:0:0: [sdb] tag#6 CDB: Read(10) 28 00 02 f2 3f 78 00 02 88 00
[ 80.693768] I/O error, dev sdb, sector 49430392 op 0x0:(READ) flags 0x80700 phys_seg 6 prio class 0
[... some more of this ...]
[ 80.693790] ata3: EH complete
[ 87.019492] logitech-hidpp-device 0003:046D:2008.0006: HID++ 1.0 device connected.I've made a long smartctl test but at the end it says:
$:> sudo smartctl -H /dev/sdb
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.1-arch1-1] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSEDThe output of "smartctl -x" can be found here
lsblk:
$:> lsblk
NAME LABEL FSTYPE SIZE FSAVAIL MOUNTPOINT UUID
sda 232,9G
├─sda1 Windows ntfs 39,1G D4104CE9104CD460
├─sda2 1K
├─sda5 VMs ext3 94,2G 69,1G /run/media/VMs 43baff62-585e-4802-b71b-3598be0bb2fb
├─sda6 swap 2,5G a6691214-03d7-4792-a31e-e6f49d7d35ab
└─sda7 Daten ntfs 97G 24,6G /run/media/Daten 4688B73734B455D1
sdb 447,1G
├─sdb1 arch_root ext4 40G 19,2G / dfb61aac-71a9-412e-9528-730c63750354
├─sdb2 arch_swap swap 4G [SWAP] 676f657f-55cc-4423-a2ae-6d6b2bda8f62
├─sdb3 arch_home ext4 100G 18,6G /home e46de322-93a7-43d2-adeb-49031a288e28
└─sdb4 daten_neu ext4 303,1G 1bbf4bfb-d079-438f-a9fa-7ba935123c9f
sdc 119,2G
├─sdc1 arch_root_new ext4 19,6G f4633465-f70c-4b86-9e95-88ab9cac5773
└─sdc2 arch_home_new ext4 58,6G cc75b52d-0ac0-4240-b459-80cc905ae795
sdd 0B
sde 0B
sr0 1024Mfstab:
$:> cat /etc/fstab
#
# /etc/fstab: static file system information
#
# <file system> <dir> <type> <options> <dump> <pass>
# root /dev/sdb1
UUID=dfb61aac-71a9-412e-9528-730c63750354 / ext4 rw,noatime,discard 0 1
# /dev/sdb2
UUID=676f657f-55cc-4423-a2ae-6d6b2bda8f62 none swap defaults 0 0
# Datenpartition /dev/sda7
UUID=4688B73734B455D1 /run/media/Daten ntfs defaults,uid=1000,utf8 0 0
# home /dev/sdb3
UUID=e46de322-93a7-43d2-adeb-49031a288e28 /home ext4 defaults,noatime,discard 0 2
# Virtuelle Mashinen /dev/sda5
UUID=43baff62-585e-4802-b71b-3598be0bb2fb /run/media/VMs ext3 defaults 0 2Should i trust smartctl?
Last edited by quiqueck (2025-06-01 08:33:34)
Offline
Man, your output seems very bad. The device isn't answering after a time, and the worse is that the journal confirm that the driver is ok
[ 80.693748] sd 2:0:0:0: [sdb] tag#5 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=34s
So, i'm not a hard drive specialist but i would not believe in smart in this case. Specially if you bough this drive for a strangely low price.
One day i bough a very low price ssd from an online shop. It shoud have 2tb but it don't had even 128gb. After 2 months it stopped working. Not even windows SMART detecting programs like DiskGenius were able to tell if the ssd was dying or completely ok, even after tests and benchmark, the disk was simply fine. But them one day I left it doing a very large download (about 100GB) and the next day it didn't turn on, it wasn't recognized by Linux or Windows on another computer, I changed the sata input and nothing.
If you are having time to see these outputs it's better do a backup.
But again, It's better to study this with more experienced guys, I'm not sure if I'm right. Maybe something in the kernel is wrong, try to test with another OS or other installation.
Another time, my system started showing a lot of I/O errors and so I needed to install arch again for it to work, after a time i discovered that my installation iso from the live usb was corrupted. Maybe it's something like this.
Last edited by John-Something-Something (2024-12-08 12:23:13)
Offline
Below are the important numbers of the smartctl output .
I'm not great with interpreting those numbers , but Total_Bad_Block & Unexpect_Power_Loss_Ct look worrisome .
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
5 Reallocated_Sector_Ct -O--CK 100 100 000 - 0
9 Power_On_Hours -O--CK 100 100 000 - 14351
12 Power_Cycle_Count -O--CK 100 100 000 - 4044
165 Total_Write/Erase_Count -O--CK 100 100 000 - 2751
166 Min_W/E_Cycle -O--CK 100 100 --- - 8
167 Min_Bad_Block/Die -O--CK 100 100 --- - 0
168 Maximum_Erase_Cycle -O--CK 100 100 --- - 19
169 Total_Bad_Block -O--CK 100 100 --- - 787
170 Unknown_Marvell_Attr -O--CK 100 100 --- - 0
171 Program_Fail_Count -O--CK 100 100 000 - 0
172 Erase_Fail_Count -O--CK 100 100 000 - 0
173 Avg_Write/Erase_Count -O--CK 100 100 000 - 8
174 Unexpect_Power_Loss_Ct -O--CK 100 100 000 - 132
184 End-to-End_Error -O--CK 100 100 --- - 0
187 Reported_Uncorrect -O--CK 100 100 000 - 0
188 Command_Timeout -O--CK 100 100 --- - 0
194 Temperature_Celsius -O---K 080 054 000 - 20 (Min/Max 7/54)
199 SATA_CRC_Error -O--CK 100 100 --- - 0
230 Perc_Write/Erase_Count -O--CK 100 100 000 - 1878 316 1878
232 Perc_Avail_Resrvd_Space PO--CK 100 100 005 - 100
233 Total_NAND_Writes_GiB -O--CK 100 100 --- - 3847
234 Perc_Write/Erase_Ct_BC -O--CK 100 100 000 - 30943
241 Total_Writes_GiB ----CK 100 100 000 - 12319
242 Total_Reads_GiB ----CK 100 100 000 - 9492
244 Thermal_Throttle -O--CK 000 100 --- - 0Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.
clean chroot building not flexible enough ?
Try clean chroot manager by graysky
Offline
[ 80.212750] ata3.00: cmd 61/e8:f8:20:87:8a/00:00:05:00:00/40 tag 31 ncq dma 118784 outhttps://wiki.archlinux.org/title/Solid_ … NCQ_errors
Also see https://wiki.archlinux.org/title/Power_ … Management - set that to "medium_power"
Then see whether the issue returns.
In the meantime, post the actual "[... some more of this ...]"
but at the end it says:
That assertion in smartctl and an empty bag are wrth slightly less than the bag.
Total bad blocks looks scary, but you've only written ~4TB (Total_NAND_Writes_GiB) and while SanDisk doesn't seem to guarantee any TBW for that model (only 3 years) their other 3 year models range 40TBW - 100TBW
Offline
All errors from new boot are here
For the NCQ errors i have added the kernel option libata.force=3.00:noncq Hopefully i interpreted the values correctly ![]()
Added a udev rule as suggested in https://wiki.archlinux.org/title/Power_ … Management
In the meantime i updated my rescue USB-Stick, and yes i made a backup just yesterday.
Will running fsck manually make a difference?
The SSD has an unused partition of 300 GB, will it be safe to use this partition for my /home?
Offline
https://pastebin.mozilla.org/zh7OTmDb is 404 - "libata.force=3.00:noncq" looks ok.
Did the errors return despite that value?
Will running fsck manually make a difference?
fsck can sanitize the filesystem f it has been corrupted as consequence of the IO errors, it cannot do anything about the IO errors itself.
Those happen on a lower level.
The SSD has an unused partition of 300 GB, will it be safe to use this partition for my /home?
Impossible to say.
1. is there a structural problem w/ the drive itfp or is this just bogus ncq/alpm?
2. If yes, is it sector degradation of does the controller give up?
A mild hint would be whether always the same sectors are affected (then you're probably facing bad sectors/blocks indeed).
Offline
Sorry, forgot to change the time, new pastebin: https://pastebin.mozilla.org/cZnN1MaH
Edit:
rsynced the old /home to the other partition. I will see if i use this partition in fstab as /home.
Last edited by quiqueck (2024-12-08 19:46:38)
Offline
The sectors are all over the place, errors are mostly read (NAND typically loses the ability to write, not read) - but ncq is still active.
This wa before you disabled it?
Offline
The sectors are all over the place, errors are mostly read (NAND typically loses the ability to write, not read) - but ncq is still active.
This wa before you disabled it?
Yes, you asked for that
In the meantime, post the actual "[... some more of this ...]"
After booting this morning (NCQ not used) all seems fine:
[ 1.903897] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 1.904368] ata3.00: FORCE: modified (noncq)
[ 1.904584] ata3.00: ATA-9: SanDisk SSD PLUS 480GB, UG5000RL, max UDMA/133
[ 1.904588] ata3.00: 937721856 sectors, multi 1: LBA48 NCQ (not used)
[ 1.905412] ata3.00: Features: Dev-Sleep
[ 1.908922] ata3.00: configured for UDMA/133and:
$:> sudo dmesg | grep sdb
[ 1.920173] sd 2:0:0:0: [sdb] 937721856 512-byte logical blocks: (480 GB/447 GiB)
[ 1.920192] sd 2:0:0:0: [sdb] Write Protect is off
[ 1.920194] sd 2:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[ 1.920208] sd 2:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 1.920229] sd 2:0:0:0: [sdb] Preferred minimum I/O size 512 bytes
[ 1.972498] sdb: sdb1 sdb2 sdb3 sdb4
[ 1.972666] sd 2:0:0:0: [sdb] Attached SCSI disk
[ 5.705156] EXT4-fs (sdb1): mounted filesystem dfb61aac-71a9-412e-9528-730c63750354 r/w with ordered data mode. Quota mode: none.
[ 6.666982] EXT4-fs (sdb1): re-mounted dfb61aac-71a9-412e-9528-730c63750354 r/w. Quota mode: none.
[ 7.785018] Adding 4194300k swap on /dev/sdb2. Priority:-2 extents:1 across:4194300k SS
[ 7.987072] EXT4-fs (sdb3): mounted filesystem e46de322-93a7-43d2-adeb-49031a288e28 r/w with ordered data mode. Quota mode: none.Many thanks, seth, for the hints! I guess i have to keep an eye on this...
Checking my journal i found that those errors appeared already in September:
Sep 06 20:02:10 kaputtnik-arch kernel: ata3.00: exception Emask 0x0 SAct 0x80c007fe SErr 0x40000 action 0x6 frozen
Sep 06 20:02:10 kaputtnik-arch kernel: ata3: SError: { CommWake }
Sep 06 20:02:10 kaputtnik-arch kernel: ata3.00: failed command: READ FPDMA QUEUED
Sep 06 20:02:10 kaputtnik-arch kernel: ata3.00: cmd 60/08:08:f8:1f:e6/00:00:08:00:00/40 tag 1 ncq dma 4096 in
res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 06 20:02:10 kaputtnik-arch kernel: ata3.00: status: { DRDY }The closest system upgrade was at the same day, 9 hours before this happens the first time. Could the kernel upgrade may cause this?
[2024-09-06T11:40:05+0200] [ALPM] upgraded linux (6.10.7.arch1-1 -> 6.10.8.arch1-1)
[2024-09-06T11:40:07+0200] [ALPM] upgraded linux-headers (6.10.7.arch1-1 -> 6.10.8.arch1-1)Not sure if should mark this as solved right now.
Offline
Forgot to mention that i still use the old /home partition.
Offline
Yes, you asked for that
Yes, just wanted to make sure the ncq-disabling parameter actually applied.
I guess i have to keep an eye on this...
Jup. You could setup a cron job or systemd timer, scanning the journal of the past 24h once a day and notify-send you a message in case sth. shows up.
Offline
Although i didn't noticed any errors since applying the changes from above, but a second hard drive also shows some strange smartctl results i bought a new 1TB SSD and exchanged the questionable hard drives. Removed also the ncq-related kernel parameter and settings for power management.
All seems fine now.
Thanks again for the comprehensive help!
Solved
Offline
Pages: 1