You are not logged in.
Thanks for all the tips. I think I'm gonna give libata.noacpi a try.
Inofficial first vice preseident of the Rust Evangelism Strike Force
Offline
System 23, whose SSD I manually sent to sleep again tripped just now:
Aug 19 21:49:32 23 nginx[369]: 2022/08/19 21:49:32 [error] 369#369: *38 open() "/usr/share/application-html/pacman-lockfile" failed (2: No such file or directory), client: 127.0.0.1, server: loc>
Aug 19 21:50:18 23 kernel: sd 1:0:0:0: [sda] tag#31 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=41s
Aug 19 21:50:18 23 kernel: sd 1:0:0:0: [sda] tag#31 CDB: Write(10) 2a 00 01 78 ca 28 00 00 18 00
Aug 19 21:50:18 23 kernel: blk_update_request: I/O error, dev sda, sector 24693288 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Aug 19 21:50:18 23 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 9308401 starting block 3086664)
Aug 19 21:50:18 23 kernel: Buffer I/O error on device sda2, logical block 2958405
Aug 19 21:50:18 23 kernel: Buffer I/O error on device sda2, logical block 2958406
Aug 19 21:50:18 23 kernel: Buffer I/O error on device sda2, logical block 2958407
Aug 19 21:50:18 23 kernel: sd 1:0:0:0: [sda] tag#3 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=41s
Aug 19 21:50:18 23 kernel: sd 1:0:0:0: [sda] tag#3 CDB: Write(10) 2a 00 01 78 cb a8 00 00 10 00
Aug 19 21:50:18 23 kernel: blk_update_request: I/O error, dev sda, sector 24693672 op 0x1:(WRITE) flags 0x0 phys_seg 2 prio class 0
Aug 19 21:50:18 23 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 9308401 starting block 3086711)
Aug 19 21:50:18 23 kernel: Buffer I/O error on device sda2, logical block 2958453
Aug 19 21:50:18 23 kernel: Buffer I/O error on device sda2, logical block 2958454
Aug 19 21:50:18 23 kernel: sd 1:0:0:0: [sda] tag#2 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=41s
Aug 19 21:50:18 23 kernel: sd 1:0:0:0: [sda] tag#2 CDB: Write(10) 2a 00 01 78 cb 88 00 00 08 00
Aug 19 21:50:18 23 kernel: blk_update_request: I/O error, dev sda, sector 24693640 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Aug 19 21:50:18 23 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 9308401 starting block 3086706)
Aug 19 21:50:18 23 kernel: Buffer I/O error on device sda2, logical block 2958449
Aug 19 21:50:18 23 kernel: sd 1:0:0:0: [sda] tag#1 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=41s
Aug 19 21:50:18 23 kernel: sd 1:0:0:0: [sda] tag#1 CDB: Write(10) 2a 00 01 78 ca e0 00 00 08 00
Aug 19 21:50:18 23 kernel: blk_update_request: I/O error, dev sda, sector 24693472 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Aug 19 21:50:18 23 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 9308401 starting block 3086685)
Aug 19 21:50:18 23 kernel: Buffer I/O error on device sda2, logical block 2958428
Aug 19 21:50:18 23 kernel: sd 1:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=41s
Aug 19 21:50:18 23 kernel: sd 1:0:0:0: [sda] tag#0 CDB: Write(10) 2a 00 01 78 ca 48 00 00 10 00
Aug 19 21:50:18 23 kernel: blk_update_request: I/O error, dev sda, sector 24693320 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Aug 19 21:50:18 23 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 9308401 starting block 3086667)
Aug 19 21:50:18 23 kernel: Buffer I/O error on device sda2, logical block 2958409
Aug 19 21:50:18 23 kernel: Buffer I/O error on device sda2, logical block 2958410
Aug 19 21:50:18 23 kernel: sd 1:0:0:0: [sda] tag#4 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=40s
Aug 19 21:50:18 23 kernel: sd 1:0:0:0: [sda] tag#4 CDB: Write(10) 2a 00 1c 1e 78 c8 00 00 08 00
Aug 19 21:50:18 23 kernel: blk_update_request: I/O error, dev sda, sector 471759048 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Aug 19 21:50:18 23 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 9306257 starting block 58969882)
Aug 19 21:50:18 23 kernel: Buffer I/O error on device sda2, logical block 58841625
Aug 19 21:50:18 23 kernel: sd 1:0:0:0: [sda] tag#7 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=40s
Aug 19 21:50:18 23 kernel: sd 1:0:0:0: [sda] tag#7 CDB: Write(10) 2a 00 0e d8 84 f8 00 00 28 00
Aug 19 21:50:18 23 kernel: blk_update_request: I/O error, dev sda, sector 249070840 op 0x1:(WRITE) flags 0x800 phys_seg 5 prio class 0
Aug 19 21:50:18 23 kernel: Aborting journal on device sda2-8.
Aug 19 21:50:18 23 kernel: EXT4-fs error (device sda2) in ext4_reserve_inode_write:5740: Journal has aborted
Aug 19 21:50:18 23 kernel: EXT4-fs error (device sda2): ext4_dirty_inode:5936: inode #9308401: comm CacheThread_Blo: mark_inode_dirty error
Aug 19 21:50:18 23 kernel: EXT4-fs error (device sda2) in ext4_dirty_inode:5937: Journal has aborted
Aug 19 21:50:18 23 kernel: EXT4-fs error (device sda2) in ext4_reserve_inode_write:5740: Journal has aborted
Aug 19 21:50:18 23 kernel: EXT4-fs error (device sda2): ext4_dirty_inode:5936: inode #9308401: comm Chrome_IOThread: mark_inode_dirty error
Aug 19 21:50:18 23 kernel: EXT4-fs error (device sda2): ext4_journal_check_start:83: comm chromium: Detected aborted journal
Aug 19 21:50:18 23 kernel: EXT4-fs error (device sda2) in ext4_dirty_inode:5937: Journal has aborted
Aug 19 21:50:18 23 kernel: EXT4-fs (sda2): Remounting filesystem read-only
System 1284 on the other hand is still running with libata.noacpi.
journal hdparm
I will add libata.noacpi to 23 and monitor their further behaviour over the weekend.
Last edited by schard (2022-08-19 20:21:38)
Inofficial first vice preseident of the Rust Evangelism Strike Force
Offline
Looks essentially the same (if we strip the humongous amount of userspace errors)
Aug 20 05:06:52 1283 kernel: ata2.00: exception Emask 0x0 SAct 0x7ff8000 SErr 0x40000 action 0x6 frozen
Aug 20 05:07:02 1283 kernel: ata2: SError: { CommWake }
Aug 20 05:07:02 1283 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Aug 20 05:07:02 1283 kernel: ata2.00: cmd 61/08:78:d0:4a:3e/00:00:00:00:00/40 tag 15 ncq dma 4096 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 20 05:07:02 1283 kernel: ata2.00: status: { DRDY }
Aug 20 05:07:02 1283 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Aug 20 05:07:02 1283 kernel: ata2.00: cmd 61/10:80:a0:4b:3e/00:00:00:00:00/40 tag 16 ncq dma 8192 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 20 05:07:02 1283 kernel: ata2.00: status: { DRDY }
Aug 20 05:07:02 1283 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Aug 20 05:07:02 1283 kernel: ata2.00: cmd 61/08:88:18:48:3e/00:00:00:00:00/40 tag 17 ncq dma 4096 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 20 05:07:02 1283 kernel: ata2.00: status: { DRDY }
Aug 20 05:07:02 1283 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Aug 20 05:07:02 1283 kernel: ata2.00: cmd 61/08:90:70:48:3e/00:00:00:00:00/40 tag 18 ncq dma 4096 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 20 05:07:02 1283 kernel: ata2.00: status: { DRDY }
Aug 20 05:07:02 1283 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Aug 20 05:07:02 1283 kernel: ata2.00: cmd 61/08:98:88:48:3e/00:00:00:00:00/40 tag 19 ncq dma 4096 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 20 05:07:02 1283 kernel: ata2.00: status: { DRDY }
Aug 20 05:07:02 1283 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Aug 20 05:07:02 1283 kernel: ata2.00: cmd 61/10:a0:a8:48:3e/00:00:00:00:00/40 tag 20 ncq dma 8192 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 20 05:07:02 1283 kernel: ata2.00: status: { DRDY }
Aug 20 05:07:02 1283 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Aug 20 05:07:02 1283 kernel: ata2.00: cmd 61/08:a8:00:49:3e/00:00:00:00:00/40 tag 21 ncq dma 4096 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 20 05:07:02 1283 kernel: ata2.00: status: { DRDY }
Aug 20 05:07:02 1283 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Aug 20 05:07:02 1283 kernel: ata2.00: cmd 61/30:b0:b0:a3:da/00:00:0e:00:00/40 tag 22 ncq dma 24576 out
res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 20 05:07:02 1283 kernel: ata2.00: status: { DRDY }
Aug 20 05:07:02 1283 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Aug 20 05:07:02 1283 kernel: ata2.00: cmd 61/08:b8:a8:ac:ef/00:00:1a:00:00/40 tag 23 ncq dma 4096 out
res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 20 05:07:02 1283 kernel: ata2.00: status: { DRDY }
Aug 20 05:07:02 1283 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Aug 20 05:07:02 1283 kernel: ata2.00: cmd 61/18:c0:e0:c7:ef/00:00:1a:00:00/40 tag 24 ncq dma 12288 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 20 05:07:02 1283 kernel: ata2.00: status: { DRDY }
Aug 20 05:07:02 1283 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Aug 20 05:07:02 1283 kernel: ata2.00: cmd 61/08:c8:08:c8:ef/00:00:1a:00:00/40 tag 25 ncq dma 4096 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 20 05:07:02 1283 kernel: ata2.00: status: { DRDY }
Aug 20 05:07:02 1283 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Aug 20 05:07:02 1283 kernel: ata2.00: cmd 61/08:d0:e8:50:04/00:00:03:00:00/40 tag 26 ncq dma 4096 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 20 05:07:02 1283 kernel: ata2.00: status: { DRDY }
Aug 20 05:07:02 1283 kernel: ata2: hard resetting link
Aug 20 05:07:02 1283 kernel: ata2: link is slow to respond, please be patient (ready=0)
Aug 20 05:07:02 1283 kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 20 05:07:02 1283 kernel: ata2.00: configured for UDMA/133
Aug 20 05:07:02 1283 kernel: ata2.00: device reported invalid CHS sector 0
Aug 20 05:07:02 1283 kernel: ata2.00: device reported invalid CHS sector 0
Aug 20 05:07:02 1283 kernel: ata2.00: device reported invalid CHS sector 0
Aug 20 05:07:02 1283 kernel: ata2.00: device reported invalid CHS sector 0
Aug 20 05:07:02 1283 kernel: ata2.00: device reported invalid CHS sector 0
Aug 20 05:07:02 1283 kernel: ata2.00: device reported invalid CHS sector 0
Aug 20 05:07:02 1283 kernel: ata2.00: device reported invalid CHS sector 0
Aug 20 05:07:02 1283 kernel: ata2.00: device reported invalid CHS sector 0
Aug 20 05:07:02 1283 kernel: ata2.00: device reported invalid CHS sector 0
Aug 20 05:07:02 1283 kernel: ata2.00: device reported invalid CHS sector 0
Aug 20 05:07:02 1283 kernel: ata2: EH complete
Aug 20 05:51:38 1283 kernel: sd 1:0:0:0: [sda] tag#15 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=47s
Aug 20 05:51:38 1283 kernel: sd 1:0:0:0: [sda] tag#15 CDB: Write(10) 2a 00 0e da ec 20 00 00 30 00
Aug 20 05:51:38 1283 kernel: blk_update_request: I/O error, dev sda, sector 249228320 op 0x1:(WRITE) flags 0x800 phys_seg 6 prio class 0
Aug 20 05:51:38 1283 kernel: sd 1:0:0:0: [sda] tag#13 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=45s
Aug 20 05:51:38 1283 kernel: sd 1:0:0:0: [sda] tag#13 CDB: Write(10) 2a 00 03 04 51 e0 00 00 08 00
Aug 20 05:51:38 1283 kernel: blk_update_request: I/O error, dev sda, sector 50614752 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Aug 20 05:51:38 1283 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 3278818 starting block 6326845)
Aug 20 05:51:38 1283 kernel: Buffer I/O error on device sda2, logical block 6198588
Aug 20 05:51:38 1283 kernel: Aborting journal on device sda2-8.
Aug 20 05:51:38 1283 kernel: EXT4-fs error (device sda2): ext4_journal_check_start:83: comm nginx: Detected aborted journal
Aug 20 05:51:38 1283 kernel: EXT4-fs error (device sda2) in ext4_reserve_inode_write:5740: Journal has aborted
Aug 20 05:51:38 1283 kernel: EXT4-fs error (device sda2): ext4_dirty_inode:5936: inode #3277017: comm chromium: mark_inode_dirty error
Aug 20 05:51:38 1283 kernel: EXT4-fs error (device sda2) in ext4_dirty_inode:5937: Journal has aborted
Aug 20 05:51:38 1283 kernel: EXT4-fs error (device sda2): ext4_journal_check_start:83: comm systemd-journal: Detected aborted journal
Aug 20 05:51:38 1283 kernel: EXT4-fs error (device sda2) in ext4_reserve_inode_write:5740: Journal has aborted
Aug 20 05:51:38 1283 kernel: EXT4-fs error (device sda2): ext4_dirty_inode:5936: inode #3277017: comm ThreadPoolForeg: mark_inode_dirty error
Aug 20 05:51:38 1283 kernel: EXT4-fs error (device sda2) in ext4_dirty_inode:5937: Journal has aborted
Aug 20 05:51:38 1283 kernel: EXT4-fs error (device sda2) in ext4_reserve_inode_write:5740: Journal has aborted
Aug 20 05:51:38 1283 kernel: EXT4-fs error (device sda2): ext4_dirty_inode:5936: inode #3277017: comm HangWatcher: mark_inode_dirty error
Aug 20 05:51:38 1283 kernel: EXT4-fs (sda2): Remounting filesystem read-only
And doesn't have libata.noacpi…
Offline
Sorry, I wasn't clear. With variant I meant, that the failed command: WRITE FPDMA QUEUED message occurs multiple times on this system, whereas it only ever occurred once on the others.
I added libata.noacpi and rebootet it. So far, so good.
Inofficial first vice preseident of the Rust Evangelism Strike Force
Offline
So. All systems where I added libata.noacpi are still running. However:
Aug 19 22:14:27 archlinux kernel: Booting kernel: `' invalid for parameter `libata.noacpi'
So it's a classic case of placebo effect.
I'll try again with libata.noacpi=1
Inofficial first vice preseident of the Rust Evangelism Strike Force
Offline
The issue happened again on system 1283 with libata.noacpi=1: journal
I now added both, libata.noacpi=1 and pcie_aspm=off, but I am running out of options.
Inofficial first vice preseident of the Rust Evangelism Strike Force
Offline
Did you on any of the failing systems disable the drive sleep, "hdparm -S 0 /dev/sda" ?
Offline
Not yet. Should I do this manually?
What about doing this permanently?
Is a systemd-unit suitable?
PS:
0 ✓ 1283 ~ $ hdparm -I /dev/sda
/dev/sda:
ATA device, with non-removable media
Model Number: ADATA_IMSS316-256GD
Serial Number: 2L0929CK42XT
Firmware Revision: S180718b
Transport: Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
Supported: 9 8 7 6 5
Likely used: 9
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 500118192
Logical Sector size: 512 bytes
Physical Sector size: 512 bytes
Logical Sector-0 offset: 0 bytes
device size with M = 1024*1024: 244198 MBytes
device size with M = 1000*1000: 256060 MBytes (256 GB)
cache/buffer size = unknown
Form Factor: 2.5 inch
Nominal Media Rotation Rate: Solid State Device
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16 Current = 16
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
* NOP cmd
* DOWNLOAD_MICROCODE
* 48-bit Address feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART error logging
* General Purpose Logging feature set
* WRITE_{DMA|MULTIPLE}_FUA_EXT
* 64-bit World wide name
* WRITE_UNCORRECTABLE_EXT command
* {READ,WRITE}_DMA_EXT_GPL commands
* Segmented DOWNLOAD_MICROCODE
* Gen1 signaling speed (1.5Gb/s)
* Gen2 signaling speed (3.0Gb/s)
* Gen3 signaling speed (6.0Gb/s)
* Native Command Queueing (NCQ)
* Phy event counters
* Device automatic Partial to Slumber transitions
* READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
* DMA Setup Auto-Activate optimization
* Device-initiated interface power management
* Software settings preservation
Device Sleep (DEVSLP)
* SMART Command Transport (SCT) feature set
* SCT Error Recovery Control (AC3)
* SCT Features Control (AC4)
* SCT Data Tables (AC5)
* Data Set Management TRIM supported (limit 8 blocks)
* Deterministic read ZEROs after TRIM
Security:
Master password revision code = 65534
supported
not enabled
not locked
frozen
not expired: security count
supported: enhanced erase
2min for SECURITY ERASE UNIT. 2min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 5707c18100a1cd79
NAA : 5
IEEE OUI : 707c18
Unique ID : 100a1cd79
Device Sleep:
DEVSLP Exit Timeout (DETO): 100 ms (drive)
Minimum DEVSLP Assertion Time (MDAT): 31 ms (drive)
Checksum: correct
0 ✓ 1283 ~ $ smartctl -a /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.15.61-1-lts] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: ADATA_IMSS316-256GD
Serial Number: 2L0929CK42XT
LU WWN Device Id: 5 707c18 100a1cd79
Firmware Version: S180718b
User Capacity: 256.060.514.304 bytes [256 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available, deterministic, zeroed
Device is: Not in smartctl database 7.3/5319
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Wed Aug 24 08:17:39 2022 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x02) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 33) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 2) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x0039) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 0
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 8537
12 Power_Cycle_Count 0x0012 100 100 000 Old_age Always - 16
167 Unknown_Attribute 0x0022 100 100 000 Old_age Always - 0
168 Unknown_Attribute 0x0012 100 100 000 Old_age Always - 0
169 Unknown_Attribute 0x0013 098 098 010 Pre-fail Always - 9437202
173 Unknown_Attribute 0x0012 200 200 000 Old_age Always - 21478572063
175 Program_Fail_Count_Chip 0x0013 100 100 010 Pre-fail Always - 0
180 Unused_Rsvd_Blk_Cnt_Tot 0x0033 100 100 020 Pre-fail Always - 10040
192 Power-Off_Retract_Count 0x0012 100 100 000 Old_age Always - 12
194 Temperature_Celsius 0x0022 047 047 030 Old_age Always - 53 (Min/Max 22/65)
231 Unknown_SSD_Attribute 0x0033 099 099 005 Pre-fail Always - 1
233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 14140108672
234 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 1908391373120
241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 10824013926
242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 9534931110
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 8480 -
# 2 Extended offline Completed without error 00% 8310 -
# 3 Extended offline Completed without error 00% 8142 -
# 4 Extended offline Completed without error 00% 7974 -
# 5 Extended offline Completed without error 00% 7806 -
# 6 Extended offline Completed without error 00% 7638 -
# 7 Extended offline Completed without error 00% 7470 -
# 8 Extended offline Completed without error 00% 7302 -
# 9 Extended offline Completed without error 00% 7134 -
#10 Extended offline Completed without error 00% 6966 -
#11 Extended offline Completed without error 00% 6798 -
#12 Extended offline Completed without error 00% 6631 -
#13 Extended offline Completed without error 00% 6464 -
#14 Extended offline Completed without error 00% 6297 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
0 ✓ 1283 ~ $ hdparm -S 0 /dev/sda
/dev/sda:
setting standby to 0 (off)
0 ✓ 1283 ~ $ hdparm -I /dev/sda
/dev/sda:
ATA device, with non-removable media
Model Number: ADATA_IMSS316-256GD
Serial Number: 2L0929CK42XT
Firmware Revision: S180718b
Transport: Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
Supported: 9 8 7 6 5
Likely used: 9
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 500118192
Logical Sector size: 512 bytes
Physical Sector size: 512 bytes
Logical Sector-0 offset: 0 bytes
device size with M = 1024*1024: 244198 MBytes
device size with M = 1000*1000: 256060 MBytes (256 GB)
cache/buffer size = unknown
Form Factor: 2.5 inch
Nominal Media Rotation Rate: Solid State Device
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16 Current = 16
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
* NOP cmd
* DOWNLOAD_MICROCODE
* 48-bit Address feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART error logging
* General Purpose Logging feature set
* WRITE_{DMA|MULTIPLE}_FUA_EXT
* 64-bit World wide name
* WRITE_UNCORRECTABLE_EXT command
* {READ,WRITE}_DMA_EXT_GPL commands
* Segmented DOWNLOAD_MICROCODE
* Gen1 signaling speed (1.5Gb/s)
* Gen2 signaling speed (3.0Gb/s)
* Gen3 signaling speed (6.0Gb/s)
* Native Command Queueing (NCQ)
* Phy event counters
* Device automatic Partial to Slumber transitions
* READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
* DMA Setup Auto-Activate optimization
* Device-initiated interface power management
* Software settings preservation
Device Sleep (DEVSLP)
* SMART Command Transport (SCT) feature set
* SCT Error Recovery Control (AC3)
* SCT Features Control (AC4)
* SCT Data Tables (AC5)
* Data Set Management TRIM supported (limit 8 blocks)
* Deterministic read ZEROs after TRIM
Security:
Master password revision code = 65534
supported
not enabled
not locked
frozen
not expired: security count
supported: enhanced erase
2min for SECURITY ERASE UNIT. 2min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 5707c18100a1cd79
NAA : 5
IEEE OUI : 707c18
Unique ID : 100a1cd79
Device Sleep:
DEVSLP Exit Timeout (DETO): 100 ms (drive)
Minimum DEVSLP Assertion Time (MDAT): 31 ms (drive)
Checksum: correct
Last edited by schard (2022-08-24 06:19:45)
Inofficial first vice preseident of the Rust Evangelism Strike Force
Offline
I tried a udev rule with hdparm -S 0 /dev/sda to no avail:
https://srv.richard-neumann.de/23.journal.4.txt
$ cat /usr/lib/udev/rules.d/99-ADATA_IMSS316-256GD.rules
ACTION=="add", SUBSYSTEM=="block", ATTRS{model}=="ADATA_IMSS316-25", RUN{program}+="/usr/bin/hdparm -S 0 /dev/$name"
Last edited by schard (2022-08-26 07:58:52)
Inofficial first vice preseident of the Rust Evangelism Strike Force
Offline
Any chance that some userspace daemon sends t to sleep?
"hdparm -B 254/255"?
-B Get/set Advanced Power Management feature, if the drive supports it. A low value
means aggressive power management and a high value means better performance. Pos‐
sible settings range from values 1 through 127 (which permit spin-down), and values
128 through 254 (which do not permit spin-down). The highest degree of power man‐
agement is attained with a setting of 1, and the highest I/O performance with a
setting of 254. A value of 255 tells hdparm to disable Advanced Power Management
altogether on the drive (not all drives support disabling it, but most do).
Offline
I'm not sure whether that worked
Aug 27 16:49:25 23 systemd-udevd[245]: sda: '/usr/bin/hdparm -B 254 /dev/sda'(out) ' setting Advanced Power Management level to 0xfe (254)'
Aug 27 16:49:25 23 systemd-udevd[234]: tty49: Device ready for processing (SEQNUM=2011, ACTION=add)
Aug 27 16:49:25 23 systemd-udevd[237]: tty48: /usr/lib/udev/rules.d/50-udev-default.rules:24 MODE 0620
Aug 27 16:49:25 23 systemd-udevd[246]: tty46: sd-device-monitor: Passed 210 byte to netlink monitor
Aug 27 16:49:25 23 systemd-udevd[245]: sda: '/usr/bin/hdparm -B 254 /dev/sda'(out) ' APM_level = not supported'
Same with 255 btw.
Update
Well, no, it didn't:
Aug 27 17:44:33 23 kernel: perf: interrupt took too long (2519 > 2500), lowering kernel.perf_event_max_sample_rate to 79400
Aug 27 17:45:00 23 kernel: ata2.00: exception Emask 0x0 SAct 0x7ff80 SErr 0x40000 action 0x6 frozen
Aug 27 17:45:00 23 kernel: ata2: SError: { CommWake }
Aug 27 17:45:00 23 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Aug 27 17:45:00 23 kernel: ata2.00: cmd 61/08:38:08:96:ef/00:00:00:00:00/40 tag 7 ncq dma 4096 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 27 17:45:00 23 kernel: ata2.00: status: { DRDY }
Aug 27 17:45:00 23 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Aug 27 17:45:00 23 kernel: ata2.00: cmd 61/08:40:30:97:ef/00:00:00:00:00/40 tag 8 ncq dma 4096 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Last edited by schard (2022-08-27 22:11:14)
Inofficial first vice preseident of the Rust Evangelism Strike Force
Offline
This has happened to a regular ordinary pc which has Arch installed completely and everything is there but the root gets mounted ro and of course nothing can be done because almost everything depends on the ability to write. I can change the files on that drive while running another system based on Arch, Manjaro. I have changed fstab with rw option and it still boots with the filesystem read only. Where else is that causing it to mount that way? I have Arch on another computer and it runs very well so there is something different between them and something missing on the one that doesn't work.
Last edited by DeTechTive (2022-09-19 14:41:21)
Offline
In the case of this topic this is down to IO errors which cause the FS to re-mount ro to prevent unreliable write operations.
I think we still assume that the cause is a failure of the drive to wake up from sleeping (not the OS, just the drive)
If you get similar errors in your dmesg you need to check the disk integrity to make sure the HW isn't defective (smartctl -a) and otherwise may likewise try to prevent the drive from resting.
However, the present is a rather special case and the vast majority of ro-remounts is because of static bus or media errors (cable / the disk falling apart)
Offline
Changing the options in fstab was the answer to the problem on the ordinary PC. Everything is mounted rw now.
Offline
Update
Well, no, it didn't:
Richard, did you make any progress with this? I see something similar (inconsistent SATA timeout failures, "failed command: FLUSH CACHE EXT" errors, partitions remounted read-only). Problem is limited to a a particular M.2 SSD (Apacer 512GB).
I would recommend simply replacing the drives, but of course we have several hundred in the field already!
Thanks,
Tim
Offline
Unfortunately no.
We're currently using a dirty workaround on the affected systems which periodically checks the FS whether it is mounted ro and reboots the system.
But the underlying issue still persists.
Inofficial first vice preseident of the Rust Evangelism Strike Force
Offline
Unfortunately no.
We're currently using a dirty workaround on the affected systems which periodically checks the FS whether it is mounted ro and reboots the system.
But the underlying issue still persists.
Thanks for the update! Our SSD vendor (Apacer) have volunteered a firmware update so I'm crossing my fingers that it resolves our issue. It's a nasty one.
I'm also seeing corrupted filesystems when this event occurs, so a straight reboot doesn't necessarily help us.
Offline
I'm also seeing corrupted filesystems when this event occurs, so a straight reboot doesn't necessarily help us.
We use the kernel parameter fsck.repair=yes which has proven to be quite robust and successfully repairs the fs (ext4) in like 99% of the cases.
Inofficial first vice preseident of the Rust Evangelism Strike Force
Offline
Nearly forgot to conclude this.
After the hardware manufacturer for a long time denied all responsibility, they recently admitted that the SSDs were in fact defective.
I don't have any further details but from there hearsay that got to me it was some kind of firmware bug that caused this on this specific type of SSD (ADATA_IMSS316-256GD).
Inofficial first vice preseident of the Rust Evangelism Strike Force
Offline