You are not logged in.

#26 2022-08-19 14:16:23

schard
Forum Moderator
From: Hannover
Registered: 2016-05-06
Posts: 2,074
Website

Re: [SOLVED] Spontaneous ro root file system on some machines

Thanks for all the tips. I think I'm gonna give libata.noacpi a try.


?לאן אתה רוצה ללכת היום

Offline

#27 2022-08-19 20:12:40

schard
Forum Moderator
From: Hannover
Registered: 2016-05-06
Posts: 2,074
Website

Re: [SOLVED] Spontaneous ro root file system on some machines

System 23, whose SSD I manually sent to sleep again tripped just now:

Aug 19 21:49:32 23 nginx[369]: 2022/08/19 21:49:32 [error] 369#369: *38 open() "/usr/share/application-html/pacman-lockfile" failed (2: No such file or directory), client: 127.0.0.1, server: loc>
Aug 19 21:50:18 23 kernel: sd 1:0:0:0: [sda] tag#31 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=41s
Aug 19 21:50:18 23 kernel: sd 1:0:0:0: [sda] tag#31 CDB: Write(10) 2a 00 01 78 ca 28 00 00 18 00
Aug 19 21:50:18 23 kernel: blk_update_request: I/O error, dev sda, sector 24693288 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Aug 19 21:50:18 23 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 9308401 starting block 3086664)
Aug 19 21:50:18 23 kernel: Buffer I/O error on device sda2, logical block 2958405
Aug 19 21:50:18 23 kernel: Buffer I/O error on device sda2, logical block 2958406
Aug 19 21:50:18 23 kernel: Buffer I/O error on device sda2, logical block 2958407
Aug 19 21:50:18 23 kernel: sd 1:0:0:0: [sda] tag#3 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=41s
Aug 19 21:50:18 23 kernel: sd 1:0:0:0: [sda] tag#3 CDB: Write(10) 2a 00 01 78 cb a8 00 00 10 00
Aug 19 21:50:18 23 kernel: blk_update_request: I/O error, dev sda, sector 24693672 op 0x1:(WRITE) flags 0x0 phys_seg 2 prio class 0
Aug 19 21:50:18 23 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 9308401 starting block 3086711)
Aug 19 21:50:18 23 kernel: Buffer I/O error on device sda2, logical block 2958453
Aug 19 21:50:18 23 kernel: Buffer I/O error on device sda2, logical block 2958454
Aug 19 21:50:18 23 kernel: sd 1:0:0:0: [sda] tag#2 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=41s
Aug 19 21:50:18 23 kernel: sd 1:0:0:0: [sda] tag#2 CDB: Write(10) 2a 00 01 78 cb 88 00 00 08 00
Aug 19 21:50:18 23 kernel: blk_update_request: I/O error, dev sda, sector 24693640 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Aug 19 21:50:18 23 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 9308401 starting block 3086706)
Aug 19 21:50:18 23 kernel: Buffer I/O error on device sda2, logical block 2958449
Aug 19 21:50:18 23 kernel: sd 1:0:0:0: [sda] tag#1 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=41s
Aug 19 21:50:18 23 kernel: sd 1:0:0:0: [sda] tag#1 CDB: Write(10) 2a 00 01 78 ca e0 00 00 08 00
Aug 19 21:50:18 23 kernel: blk_update_request: I/O error, dev sda, sector 24693472 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Aug 19 21:50:18 23 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 9308401 starting block 3086685)
Aug 19 21:50:18 23 kernel: Buffer I/O error on device sda2, logical block 2958428
Aug 19 21:50:18 23 kernel: sd 1:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=41s
Aug 19 21:50:18 23 kernel: sd 1:0:0:0: [sda] tag#0 CDB: Write(10) 2a 00 01 78 ca 48 00 00 10 00
Aug 19 21:50:18 23 kernel: blk_update_request: I/O error, dev sda, sector 24693320 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Aug 19 21:50:18 23 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 9308401 starting block 3086667)
Aug 19 21:50:18 23 kernel: Buffer I/O error on device sda2, logical block 2958409
Aug 19 21:50:18 23 kernel: Buffer I/O error on device sda2, logical block 2958410
Aug 19 21:50:18 23 kernel: sd 1:0:0:0: [sda] tag#4 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=40s
Aug 19 21:50:18 23 kernel: sd 1:0:0:0: [sda] tag#4 CDB: Write(10) 2a 00 1c 1e 78 c8 00 00 08 00
Aug 19 21:50:18 23 kernel: blk_update_request: I/O error, dev sda, sector 471759048 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Aug 19 21:50:18 23 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 9306257 starting block 58969882)
Aug 19 21:50:18 23 kernel: Buffer I/O error on device sda2, logical block 58841625
Aug 19 21:50:18 23 kernel: sd 1:0:0:0: [sda] tag#7 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=40s
Aug 19 21:50:18 23 kernel: sd 1:0:0:0: [sda] tag#7 CDB: Write(10) 2a 00 0e d8 84 f8 00 00 28 00
Aug 19 21:50:18 23 kernel: blk_update_request: I/O error, dev sda, sector 249070840 op 0x1:(WRITE) flags 0x800 phys_seg 5 prio class 0
Aug 19 21:50:18 23 kernel: Aborting journal on device sda2-8.
Aug 19 21:50:18 23 kernel: EXT4-fs error (device sda2) in ext4_reserve_inode_write:5740: Journal has aborted
Aug 19 21:50:18 23 kernel: EXT4-fs error (device sda2): ext4_dirty_inode:5936: inode #9308401: comm CacheThread_Blo: mark_inode_dirty error
Aug 19 21:50:18 23 kernel: EXT4-fs error (device sda2) in ext4_dirty_inode:5937: Journal has aborted
Aug 19 21:50:18 23 kernel: EXT4-fs error (device sda2) in ext4_reserve_inode_write:5740: Journal has aborted
Aug 19 21:50:18 23 kernel: EXT4-fs error (device sda2): ext4_dirty_inode:5936: inode #9308401: comm Chrome_IOThread: mark_inode_dirty error
Aug 19 21:50:18 23 kernel: EXT4-fs error (device sda2): ext4_journal_check_start:83: comm chromium: Detected aborted journal
Aug 19 21:50:18 23 kernel: EXT4-fs error (device sda2) in ext4_dirty_inode:5937: Journal has aborted
Aug 19 21:50:18 23 kernel: EXT4-fs (sda2): Remounting filesystem read-only

journal hdparm

System 1284 on the other hand is still running with libata.noacpi.
journal hdparm


I will add libata.noacpi to 23 and monitor their further behaviour over the weekend.

Last edited by schard (2022-08-19 20:21:38)


?לאן אתה רוצה ללכת היום

Offline

#28 2022-08-21 15:39:35

schard
Forum Moderator
From: Hannover
Registered: 2016-05-06
Posts: 2,074
Website

Re: [SOLVED] Spontaneous ro root file system on some machines

I now have a new variant of the error on another system: journal


?לאן אתה רוצה ללכת היום

Offline

#29 2022-08-21 15:55:14

seth
Member
Registered: 2012-09-03
Posts: 56,092

Re: [SOLVED] Spontaneous ro root file system on some machines

Looks essentially the same (if we strip the humongous amount of userspace errors)

Aug 20 05:06:52 1283 kernel: ata2.00: exception Emask 0x0 SAct 0x7ff8000 SErr 0x40000 action 0x6 frozen
Aug 20 05:07:02 1283 kernel: ata2: SError: { CommWake }
Aug 20 05:07:02 1283 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Aug 20 05:07:02 1283 kernel: ata2.00: cmd 61/08:78:d0:4a:3e/00:00:00:00:00/40 tag 15 ncq dma 4096 out
                                      res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 20 05:07:02 1283 kernel: ata2.00: status: { DRDY }
Aug 20 05:07:02 1283 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Aug 20 05:07:02 1283 kernel: ata2.00: cmd 61/10:80:a0:4b:3e/00:00:00:00:00/40 tag 16 ncq dma 8192 out
                                      res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 20 05:07:02 1283 kernel: ata2.00: status: { DRDY }
Aug 20 05:07:02 1283 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Aug 20 05:07:02 1283 kernel: ata2.00: cmd 61/08:88:18:48:3e/00:00:00:00:00/40 tag 17 ncq dma 4096 out
                                      res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 20 05:07:02 1283 kernel: ata2.00: status: { DRDY }
Aug 20 05:07:02 1283 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Aug 20 05:07:02 1283 kernel: ata2.00: cmd 61/08:90:70:48:3e/00:00:00:00:00/40 tag 18 ncq dma 4096 out
                                      res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 20 05:07:02 1283 kernel: ata2.00: status: { DRDY }
Aug 20 05:07:02 1283 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Aug 20 05:07:02 1283 kernel: ata2.00: cmd 61/08:98:88:48:3e/00:00:00:00:00/40 tag 19 ncq dma 4096 out
                                      res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 20 05:07:02 1283 kernel: ata2.00: status: { DRDY }
Aug 20 05:07:02 1283 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Aug 20 05:07:02 1283 kernel: ata2.00: cmd 61/10:a0:a8:48:3e/00:00:00:00:00/40 tag 20 ncq dma 8192 out
                                      res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 20 05:07:02 1283 kernel: ata2.00: status: { DRDY }
Aug 20 05:07:02 1283 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Aug 20 05:07:02 1283 kernel: ata2.00: cmd 61/08:a8:00:49:3e/00:00:00:00:00/40 tag 21 ncq dma 4096 out
                                      res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 20 05:07:02 1283 kernel: ata2.00: status: { DRDY }
Aug 20 05:07:02 1283 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Aug 20 05:07:02 1283 kernel: ata2.00: cmd 61/30:b0:b0:a3:da/00:00:0e:00:00/40 tag 22 ncq dma 24576 out
                                      res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 20 05:07:02 1283 kernel: ata2.00: status: { DRDY }
Aug 20 05:07:02 1283 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Aug 20 05:07:02 1283 kernel: ata2.00: cmd 61/08:b8:a8:ac:ef/00:00:1a:00:00/40 tag 23 ncq dma 4096 out
                                      res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 20 05:07:02 1283 kernel: ata2.00: status: { DRDY }
Aug 20 05:07:02 1283 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Aug 20 05:07:02 1283 kernel: ata2.00: cmd 61/18:c0:e0:c7:ef/00:00:1a:00:00/40 tag 24 ncq dma 12288 out
                                      res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 20 05:07:02 1283 kernel: ata2.00: status: { DRDY }
Aug 20 05:07:02 1283 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Aug 20 05:07:02 1283 kernel: ata2.00: cmd 61/08:c8:08:c8:ef/00:00:1a:00:00/40 tag 25 ncq dma 4096 out
                                      res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 20 05:07:02 1283 kernel: ata2.00: status: { DRDY }
Aug 20 05:07:02 1283 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Aug 20 05:07:02 1283 kernel: ata2.00: cmd 61/08:d0:e8:50:04/00:00:03:00:00/40 tag 26 ncq dma 4096 out
                                      res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 20 05:07:02 1283 kernel: ata2.00: status: { DRDY }
Aug 20 05:07:02 1283 kernel: ata2: hard resetting link
Aug 20 05:07:02 1283 kernel: ata2: link is slow to respond, please be patient (ready=0)
Aug 20 05:07:02 1283 kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 20 05:07:02 1283 kernel: ata2.00: configured for UDMA/133
Aug 20 05:07:02 1283 kernel: ata2.00: device reported invalid CHS sector 0
Aug 20 05:07:02 1283 kernel: ata2.00: device reported invalid CHS sector 0
Aug 20 05:07:02 1283 kernel: ata2.00: device reported invalid CHS sector 0
Aug 20 05:07:02 1283 kernel: ata2.00: device reported invalid CHS sector 0
Aug 20 05:07:02 1283 kernel: ata2.00: device reported invalid CHS sector 0
Aug 20 05:07:02 1283 kernel: ata2.00: device reported invalid CHS sector 0
Aug 20 05:07:02 1283 kernel: ata2.00: device reported invalid CHS sector 0
Aug 20 05:07:02 1283 kernel: ata2.00: device reported invalid CHS sector 0
Aug 20 05:07:02 1283 kernel: ata2.00: device reported invalid CHS sector 0
Aug 20 05:07:02 1283 kernel: ata2.00: device reported invalid CHS sector 0
Aug 20 05:07:02 1283 kernel: ata2: EH complete
Aug 20 05:51:38 1283 kernel: sd 1:0:0:0: [sda] tag#15 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=47s
Aug 20 05:51:38 1283 kernel: sd 1:0:0:0: [sda] tag#15 CDB: Write(10) 2a 00 0e da ec 20 00 00 30 00
Aug 20 05:51:38 1283 kernel: blk_update_request: I/O error, dev sda, sector 249228320 op 0x1:(WRITE) flags 0x800 phys_seg 6 prio class 0
Aug 20 05:51:38 1283 kernel: sd 1:0:0:0: [sda] tag#13 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=45s
Aug 20 05:51:38 1283 kernel: sd 1:0:0:0: [sda] tag#13 CDB: Write(10) 2a 00 03 04 51 e0 00 00 08 00
Aug 20 05:51:38 1283 kernel: blk_update_request: I/O error, dev sda, sector 50614752 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Aug 20 05:51:38 1283 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 3278818 starting block 6326845)
Aug 20 05:51:38 1283 kernel: Buffer I/O error on device sda2, logical block 6198588
Aug 20 05:51:38 1283 kernel: Aborting journal on device sda2-8.
Aug 20 05:51:38 1283 kernel: EXT4-fs error (device sda2): ext4_journal_check_start:83: comm nginx: Detected aborted journal
Aug 20 05:51:38 1283 kernel: EXT4-fs error (device sda2) in ext4_reserve_inode_write:5740: Journal has aborted
Aug 20 05:51:38 1283 kernel: EXT4-fs error (device sda2): ext4_dirty_inode:5936: inode #3277017: comm chromium: mark_inode_dirty error
Aug 20 05:51:38 1283 kernel: EXT4-fs error (device sda2) in ext4_dirty_inode:5937: Journal has aborted
Aug 20 05:51:38 1283 kernel: EXT4-fs error (device sda2): ext4_journal_check_start:83: comm systemd-journal: Detected aborted journal
Aug 20 05:51:38 1283 kernel: EXT4-fs error (device sda2) in ext4_reserve_inode_write:5740: Journal has aborted
Aug 20 05:51:38 1283 kernel: EXT4-fs error (device sda2): ext4_dirty_inode:5936: inode #3277017: comm ThreadPoolForeg: mark_inode_dirty error
Aug 20 05:51:38 1283 kernel: EXT4-fs error (device sda2) in ext4_dirty_inode:5937: Journal has aborted
Aug 20 05:51:38 1283 kernel: EXT4-fs error (device sda2) in ext4_reserve_inode_write:5740: Journal has aborted
Aug 20 05:51:38 1283 kernel: EXT4-fs error (device sda2): ext4_dirty_inode:5936: inode #3277017: comm HangWatcher: mark_inode_dirty error
Aug 20 05:51:38 1283 kernel: EXT4-fs (sda2): Remounting filesystem read-only

And doesn't have libata.noacpi…

Offline

#30 2022-08-21 18:49:41

schard
Forum Moderator
From: Hannover
Registered: 2016-05-06
Posts: 2,074
Website

Re: [SOLVED] Spontaneous ro root file system on some machines

Sorry, I wasn't clear. With variant I meant, that the failed command: WRITE FPDMA QUEUED message occurs multiple times on this system, whereas it only ever occurred once on the others.
I added libata.noacpi and rebootet it. So far, so good.


?לאן אתה רוצה ללכת היום

Offline

#31 2022-08-22 08:12:26

schard
Forum Moderator
From: Hannover
Registered: 2016-05-06
Posts: 2,074
Website

Re: [SOLVED] Spontaneous ro root file system on some machines

So. All systems where I added libata.noacpi are still running. However:

Aug 19 22:14:27 archlinux kernel: Booting kernel: `' invalid for parameter `libata.noacpi'

So it's a classic case of placebo effect.
I'll try again with libata.noacpi=1


?לאן אתה רוצה ללכת היום

Offline

#32 2022-08-24 06:10:15

schard
Forum Moderator
From: Hannover
Registered: 2016-05-06
Posts: 2,074
Website

Re: [SOLVED] Spontaneous ro root file system on some machines

The issue happened again on system 1283 with libata.noacpi=1: journal
I now added both, libata.noacpi=1 and pcie_aspm=off, but I am running out of options.


?לאן אתה רוצה ללכת היום

Offline

#33 2022-08-24 06:17:21

seth
Member
Registered: 2012-09-03
Posts: 56,092

Re: [SOLVED] Spontaneous ro root file system on some machines

Did you on any of the failing systems disable the drive sleep, "hdparm -S 0 /dev/sda" ?

Offline

#34 2022-08-24 06:18:45

schard
Forum Moderator
From: Hannover
Registered: 2016-05-06
Posts: 2,074
Website

Re: [SOLVED] Spontaneous ro root file system on some machines

Not yet. Should I do this manually?
What about doing this permanently?
Is a systemd-unit suitable?

PS:

0 ✓ 1283 ~ $ hdparm -I /dev/sda

/dev/sda:

ATA device, with non-removable media
	Model Number:       ADATA_IMSS316-256GD                     
	Serial Number:      2L0929CK42XT        
	Firmware Revision:  S180718b
	Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
	Supported: 9 8 7 6 5 
	Likely used: 9
Configuration:
	Logical		max	current
	cylinders	16383	16383
	heads		16	16
	sectors/track	63	63
	--
	CHS current addressable sectors:    16514064
	LBA    user addressable sectors:   268435455
	LBA48  user addressable sectors:   500118192
	Logical  Sector size:                   512 bytes
	Physical Sector size:                   512 bytes
	Logical Sector-0 offset:                  0 bytes
	device size with M = 1024*1024:      244198 MBytes
	device size with M = 1000*1000:      256060 MBytes (256 GB)
	cache/buffer size  = unknown
	Form Factor: 2.5 inch
	Nominal Media Rotation Rate: Solid State Device
Capabilities:
	LBA, IORDY(can be disabled)
	Queue depth: 32
	Standby timer values: spec'd by Standard, no device specific minimum
	R/W multiple sector transfer: Max = 16	Current = 16
	DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
	     Cycle time: min=120ns recommended=120ns
	PIO: pio0 pio1 pio2 pio3 pio4 
	     Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
	Enabled	Supported:
	   *	SMART feature set
	    	Security Mode feature set
	   *	Power Management feature set
	   *	Write cache
	   *	Look-ahead
	   *	NOP cmd
	   *	DOWNLOAD_MICROCODE
	   *	48-bit Address feature set
	   *	Mandatory FLUSH_CACHE
	   *	FLUSH_CACHE_EXT
	   *	SMART error logging
	   *	General Purpose Logging feature set
	   *	WRITE_{DMA|MULTIPLE}_FUA_EXT
	   *	64-bit World wide name
	   *	WRITE_UNCORRECTABLE_EXT command
	   *	{READ,WRITE}_DMA_EXT_GPL commands
	   *	Segmented DOWNLOAD_MICROCODE
	   *	Gen1 signaling speed (1.5Gb/s)
	   *	Gen2 signaling speed (3.0Gb/s)
	   *	Gen3 signaling speed (6.0Gb/s)
	   *	Native Command Queueing (NCQ)
	   *	Phy event counters
	   *	Device automatic Partial to Slumber transitions
	   *	READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
	   *	DMA Setup Auto-Activate optimization
	   *	Device-initiated interface power management
	   *	Software settings preservation
	    	Device Sleep (DEVSLP)
	   *	SMART Command Transport (SCT) feature set
	   *	SCT Error Recovery Control (AC3)
	   *	SCT Features Control (AC4)
	   *	SCT Data Tables (AC5)
	   *	Data Set Management TRIM supported (limit 8 blocks)
	   *	Deterministic read ZEROs after TRIM
Security: 
	Master password revision code = 65534
		supported
	not	enabled
	not	locked
		frozen
	not	expired: security count
		supported: enhanced erase
	2min for SECURITY ERASE UNIT. 2min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 5707c18100a1cd79
	NAA		: 5
	IEEE OUI	: 707c18
	Unique ID	: 100a1cd79
Device Sleep:
	DEVSLP Exit Timeout (DETO): 100 ms (drive)
	Minimum DEVSLP Assertion Time (MDAT): 31 ms (drive)
Checksum: correct
0 ✓ 1283 ~ $ smartctl -a /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.15.61-1-lts] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ADATA_IMSS316-256GD
Serial Number:    2L0929CK42XT
LU WWN Device Id: 5 707c18 100a1cd79
Firmware Version: S180718b
User Capacity:    256.060.514.304 bytes [256 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        Not in smartctl database 7.3/5319
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Wed Aug 24 08:17:39 2022 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(   33) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (   2) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x0039)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 0
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       8537
 12 Power_Cycle_Count       0x0012   100   100   000    Old_age   Always       -       16
167 Unknown_Attribute       0x0022   100   100   000    Old_age   Always       -       0
168 Unknown_Attribute       0x0012   100   100   000    Old_age   Always       -       0
169 Unknown_Attribute       0x0013   098   098   010    Pre-fail  Always       -       9437202
173 Unknown_Attribute       0x0012   200   200   000    Old_age   Always       -       21478572063
175 Program_Fail_Count_Chip 0x0013   100   100   010    Pre-fail  Always       -       0
180 Unused_Rsvd_Blk_Cnt_Tot 0x0033   100   100   020    Pre-fail  Always       -       10040
192 Power-Off_Retract_Count 0x0012   100   100   000    Old_age   Always       -       12
194 Temperature_Celsius     0x0022   047   047   030    Old_age   Always       -       53 (Min/Max 22/65)
231 Unknown_SSD_Attribute   0x0033   099   099   005    Pre-fail  Always       -       1
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       14140108672
234 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       1908391373120
241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       10824013926
242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       9534931110

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      8480         -
# 2  Extended offline    Completed without error       00%      8310         -
# 3  Extended offline    Completed without error       00%      8142         -
# 4  Extended offline    Completed without error       00%      7974         -
# 5  Extended offline    Completed without error       00%      7806         -
# 6  Extended offline    Completed without error       00%      7638         -
# 7  Extended offline    Completed without error       00%      7470         -
# 8  Extended offline    Completed without error       00%      7302         -
# 9  Extended offline    Completed without error       00%      7134         -
#10  Extended offline    Completed without error       00%      6966         -
#11  Extended offline    Completed without error       00%      6798         -
#12  Extended offline    Completed without error       00%      6631         -
#13  Extended offline    Completed without error       00%      6464         -
#14  Extended offline    Completed without error       00%      6297         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
0 ✓ 1283 ~ $ hdparm -S 0 /dev/sda

/dev/sda:
 setting standby to 0 (off)
0 ✓ 1283 ~ $ hdparm -I /dev/sda

/dev/sda:

ATA device, with non-removable media
	Model Number:       ADATA_IMSS316-256GD                     
	Serial Number:      2L0929CK42XT        
	Firmware Revision:  S180718b
	Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
	Supported: 9 8 7 6 5 
	Likely used: 9
Configuration:
	Logical		max	current
	cylinders	16383	16383
	heads		16	16
	sectors/track	63	63
	--
	CHS current addressable sectors:    16514064
	LBA    user addressable sectors:   268435455
	LBA48  user addressable sectors:   500118192
	Logical  Sector size:                   512 bytes
	Physical Sector size:                   512 bytes
	Logical Sector-0 offset:                  0 bytes
	device size with M = 1024*1024:      244198 MBytes
	device size with M = 1000*1000:      256060 MBytes (256 GB)
	cache/buffer size  = unknown
	Form Factor: 2.5 inch
	Nominal Media Rotation Rate: Solid State Device
Capabilities:
	LBA, IORDY(can be disabled)
	Queue depth: 32
	Standby timer values: spec'd by Standard, no device specific minimum
	R/W multiple sector transfer: Max = 16	Current = 16
	DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
	     Cycle time: min=120ns recommended=120ns
	PIO: pio0 pio1 pio2 pio3 pio4 
	     Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
	Enabled	Supported:
	   *	SMART feature set
	    	Security Mode feature set
	   *	Power Management feature set
	   *	Write cache
	   *	Look-ahead
	   *	NOP cmd
	   *	DOWNLOAD_MICROCODE
	   *	48-bit Address feature set
	   *	Mandatory FLUSH_CACHE
	   *	FLUSH_CACHE_EXT
	   *	SMART error logging
	   *	General Purpose Logging feature set
	   *	WRITE_{DMA|MULTIPLE}_FUA_EXT
	   *	64-bit World wide name
	   *	WRITE_UNCORRECTABLE_EXT command
	   *	{READ,WRITE}_DMA_EXT_GPL commands
	   *	Segmented DOWNLOAD_MICROCODE
	   *	Gen1 signaling speed (1.5Gb/s)
	   *	Gen2 signaling speed (3.0Gb/s)
	   *	Gen3 signaling speed (6.0Gb/s)
	   *	Native Command Queueing (NCQ)
	   *	Phy event counters
	   *	Device automatic Partial to Slumber transitions
	   *	READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
	   *	DMA Setup Auto-Activate optimization
	   *	Device-initiated interface power management
	   *	Software settings preservation
	    	Device Sleep (DEVSLP)
	   *	SMART Command Transport (SCT) feature set
	   *	SCT Error Recovery Control (AC3)
	   *	SCT Features Control (AC4)
	   *	SCT Data Tables (AC5)
	   *	Data Set Management TRIM supported (limit 8 blocks)
	   *	Deterministic read ZEROs after TRIM
Security: 
	Master password revision code = 65534
		supported
	not	enabled
	not	locked
		frozen
	not	expired: security count
		supported: enhanced erase
	2min for SECURITY ERASE UNIT. 2min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 5707c18100a1cd79
	NAA		: 5
	IEEE OUI	: 707c18
	Unique ID	: 100a1cd79
Device Sleep:
	DEVSLP Exit Timeout (DETO): 100 ms (drive)
	Minimum DEVSLP Assertion Time (MDAT): 31 ms (drive)
Checksum: correct

Last edited by schard (2022-08-24 06:19:45)


?לאן אתה רוצה ללכת היום

Offline

#35 2022-08-26 07:53:35

schard
Forum Moderator
From: Hannover
Registered: 2016-05-06
Posts: 2,074
Website

Re: [SOLVED] Spontaneous ro root file system on some machines

I tried a udev rule with hdparm -S 0 /dev/sda to no avail:
https://srv.richard-neumann.de/23.journal.4.txt

$ cat /usr/lib/udev/rules.d/99-ADATA_IMSS316-256GD.rules 
ACTION=="add", SUBSYSTEM=="block", ATTRS{model}=="ADATA_IMSS316-25", RUN{program}+="/usr/bin/hdparm -S 0 /dev/$name"

Last edited by schard (2022-08-26 07:58:52)


?לאן אתה רוצה ללכת היום

Offline

#36 2022-08-26 11:24:24

seth
Member
Registered: 2012-09-03
Posts: 56,092

Re: [SOLVED] Spontaneous ro root file system on some machines

Any chance that some userspace daemon sends t to sleep?
"hdparm -B 254/255"?

-B     Get/set Advanced Power Management feature, if the drive supports it.  A  low  value
              means  aggressive power management and a high value means better performance.  Pos‐
              sible settings range from values 1 through 127 (which permit spin-down), and values
              128  through 254 (which do not permit spin-down).  The highest degree of power man‐
              agement is attained with a setting of 1, and the highest  I/O  performance  with  a
              setting  of  254.  A value of 255 tells hdparm to disable Advanced Power Management
              altogether on the drive (not all drives support disabling it, but most do).

Offline

#37 2022-08-27 15:15:22

schard
Forum Moderator
From: Hannover
Registered: 2016-05-06
Posts: 2,074
Website

Re: [SOLVED] Spontaneous ro root file system on some machines

I'm not sure whether that worked

Aug 27 16:49:25 23 systemd-udevd[245]: sda: '/usr/bin/hdparm -B 254 /dev/sda'(out) ' setting Advanced Power Management level to 0xfe (254)'
Aug 27 16:49:25 23 systemd-udevd[234]: tty49: Device ready for processing (SEQNUM=2011, ACTION=add)
Aug 27 16:49:25 23 systemd-udevd[237]: tty48: /usr/lib/udev/rules.d/50-udev-default.rules:24 MODE 0620
Aug 27 16:49:25 23 systemd-udevd[246]: tty46: sd-device-monitor: Passed 210 byte to netlink monitor
Aug 27 16:49:25 23 systemd-udevd[245]: sda: '/usr/bin/hdparm -B 254 /dev/sda'(out) ' APM_level        = not supported'

Same with 255 btw.

Update
Well, no, it didn't:

Aug 27 17:44:33 23 kernel: perf: interrupt took too long (2519 > 2500), lowering kernel.perf_event_max_sample_rate to 79400
Aug 27 17:45:00 23 kernel: ata2.00: exception Emask 0x0 SAct 0x7ff80 SErr 0x40000 action 0x6 frozen
Aug 27 17:45:00 23 kernel: ata2: SError: { CommWake }
Aug 27 17:45:00 23 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Aug 27 17:45:00 23 kernel: ata2.00: cmd 61/08:38:08:96:ef/00:00:00:00:00/40 tag 7 ncq dma 4096 out
                                    res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 27 17:45:00 23 kernel: ata2.00: status: { DRDY }
Aug 27 17:45:00 23 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Aug 27 17:45:00 23 kernel: ata2.00: cmd 61/08:40:30:97:ef/00:00:00:00:00/40 tag 8 ncq dma 4096 out
                                    res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Last edited by schard (2022-08-27 22:11:14)


?לאן אתה רוצה ללכת היום

Offline

#38 2022-09-19 14:18:44

DeTechTive
Member
Registered: 2022-09-19
Posts: 4

Re: [SOLVED] Spontaneous ro root file system on some machines

This has happened to a regular ordinary pc which has Arch installed completely and everything is there but the root gets mounted ro and of course nothing can be done because almost everything depends on the ability to write. I can change the files on that drive while running another system based on Arch, Manjaro. I have changed fstab with rw option and it still boots with the filesystem read only. Where else is that causing it to mount that way? I have Arch on another computer and it runs very well so there is something different between them and something missing on the one that doesn't work.

Last edited by DeTechTive (2022-09-19 14:41:21)

Offline

#39 2022-09-19 14:55:28

seth
Member
Registered: 2012-09-03
Posts: 56,092

Re: [SOLVED] Spontaneous ro root file system on some machines

In the case of this topic this is down to IO errors which cause the FS to re-mount ro to prevent unreliable write operations.
I think we still assume that the cause is a failure of the drive to wake up from sleeping (not the OS, just the drive)

If you get similar errors in your dmesg you need to check the disk integrity to make sure the HW isn't defective (smartctl -a) and otherwise may likewise try to prevent the drive from resting.
However, the present is a rather special case and the vast majority of ro-remounts is because of static bus or media errors (cable / the disk falling apart)

Offline

#40 2022-09-20 18:15:09

DeTechTive
Member
Registered: 2022-09-19
Posts: 4

Re: [SOLVED] Spontaneous ro root file system on some machines

Changing the options in fstab was the answer to the problem on the ordinary PC. Everything is mounted rw now.

Offline

#41 2023-05-19 15:11:30

timmins
Member
Registered: 2023-05-19
Posts: 2

Re: [SOLVED] Spontaneous ro root file system on some machines

Update
Well, no, it didn't:

Richard, did you make any progress with this? I see something similar (inconsistent SATA timeout failures, "failed command: FLUSH CACHE EXT" errors, partitions remounted read-only). Problem is limited to a a particular M.2 SSD (Apacer 512GB).

I would recommend simply replacing the drives, but of course we have several hundred in the field already!

Thanks,
Tim

Offline

#42 2023-05-19 17:17:01

schard
Forum Moderator
From: Hannover
Registered: 2016-05-06
Posts: 2,074
Website

Re: [SOLVED] Spontaneous ro root file system on some machines

Unfortunately no.
We're currently using a dirty workaround on the affected systems which periodically checks the FS whether it is mounted ro and reboots the system.
But the underlying issue still persists.


?לאן אתה רוצה ללכת היום

Offline

#43 2023-05-22 21:16:05

timmins
Member
Registered: 2023-05-19
Posts: 2

Re: [SOLVED] Spontaneous ro root file system on some machines

schard wrote:

Unfortunately no.
We're currently using a dirty workaround on the affected systems which periodically checks the FS whether it is mounted ro and reboots the system.
But the underlying issue still persists.

Thanks for the update! Our SSD vendor (Apacer) have volunteered a firmware update so I'm crossing my fingers that it resolves our issue. It's a nasty one.

I'm also seeing corrupted filesystems when this event occurs, so a straight reboot doesn't necessarily help us.

Offline

#44 2023-05-22 22:14:14

schard
Forum Moderator
From: Hannover
Registered: 2016-05-06
Posts: 2,074
Website

Re: [SOLVED] Spontaneous ro root file system on some machines

timmins wrote:

I'm also seeing corrupted filesystems when this event occurs, so a straight reboot doesn't necessarily help us.

We use the kernel parameter fsck.repair=yes which has proven to be quite robust and successfully repairs the fs (ext4) in like 99% of the cases.


?לאן אתה רוצה ללכת היום

Offline

#45 2023-08-31 20:50:30

schard
Forum Moderator
From: Hannover
Registered: 2016-05-06
Posts: 2,074
Website

Re: [SOLVED] Spontaneous ro root file system on some machines

Nearly forgot to conclude this.
After the hardware manufacturer for a long time denied all responsibility, they recently admitted that the SSDs were in fact defective.
I don't have any further details but from there hearsay that got to me it was some kind of firmware bug that caused this on this specific type of SSD (ADATA_IMSS316-256GD).


?לאן אתה רוצה ללכת היום

Offline

Board footer

Powered by FluxBB