[SOLVED] Spontaneous ro root file system on some machines

schard · 2022-07-13 14:06:00

I have the issue that on some machines, the root file system is re-mountet read-only after some time.
The systems are digital signage systems with industry hardware, that run 24/7.
As an example I picked out one of the candidates:

0 ✓ 1261 ~ $ mount
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
sys on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
dev on /dev type devtmpfs (rw,nosuid,relatime,size=1930676k,nr_inodes=482669,mode=755,inode64)
run on /run type tmpfs (rw,nosuid,nodev,relatime,mode=755,inode64)
efivarfs on /sys/firmware/efi/efivars type efivarfs (rw,nosuid,nodev,noexec,relatime)
/dev/sda2 on / type ext4 (ro,relatime)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,inode64)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
bpf on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=30,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=11553)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
tracefs on /sys/kernel/tracing type tracefs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /tmp type tmpfs (rw,nosuid,nodev,size=1955248k,nr_inodes=1048576,inode64)
fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime)
configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
none on /run/credentials/systemd-sysusers.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)
/dev/sda1 on /boot type vfat (rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,utf8,errors=remount-ro)
tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=391048k,nr_inodes=97762,mode=700,inode64)

0 ✓ 1261 ~ $ cat /etc/fstab 
# /dev/sda2 UUID=a1982c7a-261b-451b-8dce-570e69bcf568
LABEL=root          	/         	ext4      	rw,relatime	0 1

# /dev/sda1 UUID=2290-764C
LABEL=EFI           	/boot     	vfat      	rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,utf8,errors=remount-ro	0 2

0 ✓ 1261 ~ $ LANG=C df -h
Filesystem      Size  Used Avail Use% Mounted on
dev             1.9G     0  1.9G   0% /dev
run             1.9G  820K  1.9G   1% /run
/dev/sda2       234G  7.0G  215G   4% /
tmpfs           1.9G     0  1.9G   0% /dev/shm
tmpfs           1.9G     0  1.9G   0% /tmp
/dev/sda1       499M   53M  447M  11% /boot
tmpfs           382M     0  382M   0% /run/user/0

dmesg
journalctl -b

The interesting part seems to be:

Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#21 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=40s
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#21 CDB: Write(10) 2a 00 0e f3 48 28 00 00 18 00
Jun 22 22:20:28 1261 kernel: blk_update_request: I/O error, dev sda, sector 250824744 op 0x1:(WRITE) flags 0x800 phys_seg 3 prio class 0
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#20 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=40s
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#20 CDB: Write(10) 2a 00 00 e2 ea e0 00 00 08 00
Jun 22 22:20:28 1261 kernel: blk_update_request: I/O error, dev sda, sector 14871264 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Jun 22 22:20:28 1261 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 524975 starting block 1858909)
Jun 22 22:20:28 1261 kernel: Buffer I/O error on device sda2, logical block 1730652
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#19 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=40s
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#19 CDB: Write(10) 2a 00 00 e2 ea 18 00 00 18 00
Jun 22 22:20:28 1261 kernel: blk_update_request: I/O error, dev sda, sector 14871064 op 0x1:(WRITE) flags 0x0 phys_seg 2 prio class 0
Jun 22 22:20:28 1261 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 524975 starting block 1858886)
Jun 22 22:20:28 1261 kernel: Buffer I/O error on device sda2, logical block 1730627
Jun 22 22:20:28 1261 kernel: Buffer I/O error on device sda2, logical block 1730628
Jun 22 22:20:28 1261 kernel: Buffer I/O error on device sda2, logical block 1730629
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#31 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=35s
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#31 CDB: Write(10) 2a 00 00 e2 e9 08 00 00 08 00
Jun 22 22:20:28 1261 kernel: blk_update_request: I/O error, dev sda, sector 14870792 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Jun 22 22:20:28 1261 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 524975 starting block 1858850)
Jun 22 22:20:28 1261 kernel: Buffer I/O error on device sda2, logical block 1730593
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#23 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=35s
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#23 CDB: Write(10) 2a 00 00 e2 eb 88 00 00 08 00
Jun 22 22:20:28 1261 kernel: blk_update_request: I/O error, dev sda, sector 14871432 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Jun 22 22:20:28 1261 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 524975 starting block 1858930)
Jun 22 22:20:28 1261 kernel: Buffer I/O error on device sda2, logical block 1730673
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#22 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=35s
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#22 CDB: Write(10) 2a 00 00 e2 eb 00 00 00 08 00
Jun 22 22:20:28 1261 kernel: blk_update_request: I/O error, dev sda, sector 14871296 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Jun 22 22:20:28 1261 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 524975 starting block 1858913)
Jun 22 22:20:28 1261 kernel: Buffer I/O error on device sda2, logical block 1730656
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=35s
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#0 CDB: Write(10) 2a 00 00 e2 e9 d0 00 00 10 00
Jun 22 22:20:28 1261 kernel: blk_update_request: I/O error, dev sda, sector 14870992 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Jun 22 22:20:28 1261 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 524975 starting block 1858876)
Jun 22 22:20:28 1261 kernel: Buffer I/O error on device sda2, logical block 1730618
Jun 22 22:20:28 1261 kernel: Buffer I/O error on device sda2, logical block 1730619
Jun 22 22:20:28 1261 kernel: Aborting journal on device sda2-8.
Jun 22 22:20:28 1261 kernel: EXT4-fs error (device sda2) in ext4_reserve_inode_write:5731: Journal has aborted
Jun 22 22:20:28 1261 kernel: EXT4-fs error (device sda2): ext4_dirty_inode:5927: inode #524975: comm chromium: mark_inode_dirty error
Jun 22 22:20:28 1261 kernel: EXT4-fs error (device sda2) in ext4_dirty_inode:5928: Journal has aborted
Jun 22 22:20:28 1261 kernel: EXT4-fs error (device sda2): ext4_journal_check_start:83: comm systemd-journal: Detected aborted journal
Jun 22 22:20:28 1261 kernel: EXT4-fs (sda2): Remounting filesystem read-only
Jun 22 22:21:06 1261 sshd[73397]: error: kex_exchange_identification: read: Connection reset by peer

Does anybody have an idea why this is happening?
According to smartctl, the disk is fine:

$ smartctl -a /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.15.45-1-lts] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ADATA_IMSS316-256GD
Serial Number:    2K422L127DKD
LU WWN Device Id: 5 707c18 100985dfd
Firmware Version: S180718b
User Capacity:    256.060.514.304 bytes [256 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        Not in smartctl database 7.3/5319
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Wed Jul 13 16:05:38 2022 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 249)	Self-test routine in progress...
					90% of test remaining.
Total time to complete Offline 
data collection: 		(   33) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (   2) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x0039)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 0
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       8442
 12 Power_Cycle_Count       0x0012   100   100   000    Old_age   Always       -       9
167 Unknown_Attribute       0x0022   100   100   000    Old_age   Always       -       0
168 Unknown_Attribute       0x0012   100   100   000    Old_age   Always       -       0
169 Unknown_Attribute       0x0013   099   099   010    Pre-fail  Always       -       4194312
173 Unknown_Attribute       0x0012   200   200   000    Old_age   Always       -       12885229571
175 Program_Fail_Count_Chip 0x0013   100   100   010    Pre-fail  Always       -       0
180 Unused_Rsvd_Blk_Cnt_Tot 0x0033   100   100   020    Pre-fail  Always       -       10120
192 Power-Off_Retract_Count 0x0012   100   100   000    Old_age   Always       -       7
194 Temperature_Celsius     0x0022   055   055   030    Old_age   Always       -       45 (Min/Max 21/61)
231 Unknown_SSD_Attribute   0x0033   100   100   005    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       1514614464
234 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       3033051297024
241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       542263652
242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       238273046

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      8384         -
# 2  Extended offline    Completed without error       00%      8206         -
# 3  Extended offline    Completed without error       00%      8028         -
# 4  Extended offline    Completed without error       00%      7844         -
# 5  Extended offline    Completed without error       00%      7676         -
# 6  Extended offline    Completed without error       00%      7508         -
# 7  Extended offline    Completed without error       00%      7341         -
# 8  Extended offline    Completed without error       00%      7173         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Last edited by schard (2023-08-31 20:50:45)

ewaller · 2022-07-13 14:19:32

The change to read only is usually an indication of a I/O error, which it appears has happened in your case.
Are these SSDs or rotating disks? Are these indoor or outdoor displays? I understand ambient temperatures have been near 30C in (what I assume) is your part of the planet. With solar loading and normal operating temperature rise you could well be bumping your head on the maximum operating temperatures for commercial rated equipment. Heck, around here with solar loading our passively cooled systems can blow through the operating temperature limits of industrial rated drives

Last edited by ewaller (2022-07-13 14:19:57)

schard · 2022-07-13 14:24:50

The systems are mounted indoors.
They are tenant information systems (digital black boards) mounted on the walls near the entry of stair cases in apartment buildings.

seth · 2022-07-13 14:49:20

* cable
* cable
* power supply
* cable
* power saving
* cable

If this happens somewhat predictably, can you prevent it by keeping the drive busy, eg. frequently read random files (to avoid the cache) into /dev/null or so?
Or, delay the low-power (hdparm -S 251, check the manpage on the parameter) or maybe vv. trigger the behavior by sending the drive to sleep.
Also keep in mind any power management daemons - and aspm might spoil you if it's the bus.

Did I mention the cable?

ewaller · 2022-07-13 15:18:09

Seth, I agree about the cable. What I am hung up on is that this is multiple machines. That leads me to want to look for something systemic.

Schard, how long have these been fielded? Were they all fielded at the same time, or where they installed progressively over time?

schard · 2022-07-14 08:32:11

@seth I assume with "cable" you are referring to the connection between the mainboard and the SSD?
The SSD is an mSATA SSD and is thusly directly plugged into a socket on the mainboard.
While it is unlikely that dust has come between the contacts and plug, I will check this out once the device has been brought in.
I will also check any possible power saving settings.
However, we have hundreds of identical devices (all running Arch Linux) out in the field and this happens only on a tiny fraction of those.

Another problem is, that after issuing a reboot, the devices do not boot anymore, but usually hang with this error message.

@ewaller
The systems have been out in the field for different amounts of time. Some are new, some are older. The system in question is rather new and has been deployed for about 2 1/2 years.

Last edited by schard (2022-07-14 08:33:13)

ewaller · 2022-07-14 14:11:39

This is a clear case of bitrot

What fixes them? Reinstalling? Reformatting? Replacing the drive?
My hunch remains is that it is systemic and is environmental, though I will never discount cables. I should mention, cables are the bane of every engineer's existence, myself included.
If is is not thermal, perhaps EMC/EMI? Unlikely to be shock, vibration, humidity, sand, dust, fungal growth or corrosion due to salt fog.

What do the power supplies look like? I am sure you are using stuff that meet the CE directives for conducted susceptibility for low voltage systems -- but mains power can get ugly this time of year too. Any signs other systems are spontaneously rebooting? Any evidence of power failures across the network?

seth · 2022-07-14 14:28:14

The SSD is an mSATA SSD and is thusly directly plugged into a socket on the mainboard.
While it is unlikely that dust has come between the contacts and plug, I will check this out once the device has been brought in.

For mSATAs, the demon is usually the securing screw/s
They can end up bending the card out of the slot and certainly get it under a lot of tension. You might even have lost the slot, by the card ripping it off the mainboard.
If all devices are the same, this could come down to how tight the screw was fixed - or whether it's canted.

However, we have hundreds of identical devices (all running Arch Linux) out in the field and this happens only on a tiny fraction of those.

Suggests it's the HW, not the SW.

schard · 2022-07-15 08:29:35

ewaller wrote:

What fixes them? Reinstalling? Reformatting? Replacing the drive?

Indeed in many cases a re-installation of the whole OS fixes the issue and the systems run again fine with the same disks.
This it what baffles me. Of course also the hardware manufacturer uses this fact as an excuse that the error must be software-related and not related to their hardware.

ewaller wrote:

What do the power supplies look like?

The power supplies are built into the systems and are not accessible from the access cover for the main board.
I'll have to ask the hardware manufacturer for details.

ewaller wrote:

Any signs other systems are spontaneously rebooting?

None that I know of. The systems run fine 24/7 without reboots.
The only reboots are done on the more or less regular "patch days".

ewaller wrote:

Any evidence of power failures across the network?

Unfortunately it is hard to come by this kind of information.
The systems are spread all over the country and power grid failures are a fairly seldom occasion here in Germany and mostly a local occurence.
I know that we had down times in the past due to power failures in at least two cities, but none of them resulted in defects like this.
The systems just booted up without any issues once power was restored.

seth wrote:

For mSATAs, the demon is usually the securing screw/s

I will definitely check this out.
I expect the system to be brought in within the upcoming week.

schard · 2022-08-10 16:44:01

So. After analyzing a faulty system this week, I found out, by just swapping around SSDs, that the SSD of the faulty system is defective. It produces the same errors when plugged into another system and the faulty system runs fine with another SSD.

ewaller · 2022-08-10 17:17:24

Glad you got it sorted, but I thought you had said 'some' machines? That led me to look for something more systemic.

schard · 2022-08-10 22:16:19

Yes there are more, but I doubt I will get to have a look at them, because, well, costs.

schard · 2022-08-12 15:14:11

I have to amend, that this topic is not yet solved.
We decided to test the disk further.
On the very same system, Windows 10 runs seamlessly.
Chkdsk does not detect any errors with the disk during an extended check (chkdsk /R).
Also the ADATA SSD ToolBox finds nothing wrong with the drive in either the short or extended test.
I am starting to consider a kernel bug.

seth · 2022-08-12 15:26:27

seth wrote:

Also keep in mind any power management daemons - and aspm might spoil you if it's the bus.

Tried "pcie_aspm=off" ?

schard · 2022-08-12 16:09:42

Not yet, thanks for the reminder. Will try on Monday.

schard · 2022-08-15 20:55:48

I tested the SSD with

# shred -z /dev/sda

from a live system, which terminated after 2 1/2 hours or so without any errors (exit code 0, no output)
No I/O errors so far.
Some more details:
dmesg
journalctl -b
/proc/cmdline
lspci -nnk
hdparm -I /dev/sda
smartctl -a /dev/sda

Update
I reinstalled our most recent system image.
It ran for over 2 hours without any issues.
I tried downgrading to kernel 5.15.45-1-lts, with which we initially had the issue.
So far the system even runs on this kernel without any issues: http://ix.io/47L8
I am absolutely baffled as to why the I/O errors suddenly disappeared.

Update 2
I got the same general issue on three other systems.
They have an additional error message before the I/O error occurs:

Aug 16 06:33:11 1269 kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x40000 action 0x6 frozen
Aug 16 06:33:13 1269 kernel: ata2: SError: { CommWake }
Aug 16 06:33:13 1269 kernel: ata2.00: failed command: FLUSH CACHE EXT
Aug 16 06:33:13 1269 kernel: ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 2
                                      res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 16 06:33:13 1269 kernel: ata2.00: status: { DRDY }
Aug 16 06:33:13 1269 kernel: ata2: hard resetting link
Aug 16 06:33:13 1269 kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 16 06:33:13 1269 kernel: ata2.00: configured for UDMA/133
Aug 16 06:33:13 1269 kernel: ata2.00: retrying FLUSH 0xea Emask 0x4
Aug 16 06:33:13 1269 kernel: ata2: EH complete
Aug 16 07:09:35 1269 kernel: sd 1:0:0:0: [sda] tag#1 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=49s
Aug 16 07:09:35 1269 kernel: sd 1:0:0:0: [sda] tag#1 CDB: Write(10) 2a 00 0e f1 af 10 00 00 58 00
Aug 16 07:09:35 1269 kernel: blk_update_request: I/O error, dev sda, sector 250720016 op 0x1:(WRITE) flags 0x800 phys_seg 11 prio class 0
Aug 16 07:09:35 1269 kernel: Aborting journal on device sda2-8.
Aug 16 07:09:35 1269 kernel: EXT4-fs error (device sda2) in ext4_do_update_inode:5127: Journal has aborted
Aug 16 07:09:35 1269 kernel: EXT4-fs error (device sda2): ext4_dirty_inode:5922: inode #2491101: comm Chrome_IOThread: mark_inode_dirty error
Aug 16 07:09:35 1269 kernel: EXT4-fs error (device sda2) in ext4_reserve_inode_write:5726: Journal has aborted
Aug 16 07:09:35 1269 kernel: EXT4-fs error (device sda2) in ext4_dirty_inode:5923: IO failure
Aug 16 07:09:35 1269 kernel: EXT4-fs error (device sda2): ext4_dirty_inode:5922: inode #2491101: comm MemoryInfra: mark_inode_dirty error
Aug 16 07:09:36 1269 kernel: EXT4-fs error (device sda2) in ext4_dirty_inode:5923: Journal has aborted
Aug 16 07:09:36 1269 kernel: EXT4-fs error (device sda2) in ext4_reserve_inode_write:5726: Journal has aborted
Aug 16 07:09:36 1269 kernel: EXT4-fs error (device sda2): ext4_dirty_inode:5922: inode #2491101: comm chromium: mark_inode_dirty error
Aug 16 07:09:36 1269 kernel: EXT4-fs error (device sda2) in ext4_dirty_inode:5923: Journal has aborted
Aug 16 07:09:36 1269 kernel: EXT4-fs error (device sda2): ext4_journal_check_start:83: comm chromium: Detected aborted journal
Aug 16 07:09:36 1269 kernel: EXT4-fs (sda2): Remounting filesystem read-only

If I interpret this correctly, there is some issue with waking up the disk from a suspended state.
Time to finally try Seth's advise.

Last edited by schard (2022-08-16 13:58:26)

schard · 2022-08-18 09:05:01

So, after a few more tests, I found out, that the issue only occurs on systems with SSDs of the type ADATA_IMSS316-256GD.
In some cases, before the I/O errors happen, there are the aforementioned ata2 errors:

Aug 16 06:33:11 1269 kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x40000 action 0x6 frozen
Aug 16 06:33:13 1269 kernel: ata2: SError: { CommWake }
Aug 16 06:33:13 1269 kernel: ata2.00: failed command: FLUSH CACHE EXT
Aug 16 06:33:13 1269 kernel: ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 2
                                      res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 16 06:33:13 1269 kernel: ata2.00: status: { DRDY }
Aug 16 06:33:13 1269 kernel: ata2: hard resetting link
Aug 16 06:33:13 1269 kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 16 06:33:13 1269 kernel: ata2.00: configured for UDMA/133
Aug 16 06:33:13 1269 kernel: ata2.00: retrying FLUSH 0xea Emask 0x4
Aug 16 06:33:13 1269 kernel: ata2: EH complete

however, on other system, those do not occur and the journal only shows the I/O errors with the subsequent ro-remounts:

Aug 16 07:09:35 1269 kernel: sd 1:0:0:0: [sda] tag#1 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=49s
Aug 16 07:09:35 1269 kernel: sd 1:0:0:0: [sda] tag#1 CDB: Write(10) 2a 00 0e f1 af 10 00 00 58 00
Aug 16 07:09:35 1269 kernel: blk_update_request: I/O error, dev sda, sector 250720016 op 0x1:(WRITE) flags 0x800 phys_seg 11 prio class 0
Aug 16 07:09:35 1269 kernel: Aborting journal on device sda2-8.
Aug 16 07:09:35 1269 kernel: EXT4-fs error (device sda2) in ext4_do_update_inode:5127: Journal has aborted
Aug 16 07:09:35 1269 kernel: EXT4-fs error (device sda2): ext4_dirty_inode:5922: inode #2491101: comm Chrome_IOThread: mark_inode_dirty error
Aug 16 07:09:35 1269 kernel: EXT4-fs error (device sda2) in ext4_reserve_inode_write:5726: Journal has aborted
Aug 16 07:09:35 1269 kernel: EXT4-fs error (device sda2) in ext4_dirty_inode:5923: IO failure
Aug 16 07:09:35 1269 kernel: EXT4-fs error (device sda2): ext4_dirty_inode:5922: inode #2491101: comm MemoryInfra: mark_inode_dirty error
Aug 16 07:09:36 1269 kernel: EXT4-fs error (device sda2) in ext4_dirty_inode:5923: Journal has aborted
Aug 16 07:09:36 1269 kernel: EXT4-fs error (device sda2) in ext4_reserve_inode_write:5726: Journal has aborted
Aug 16 07:09:36 1269 kernel: EXT4-fs error (device sda2): ext4_dirty_inode:5922: inode #2491101: comm chromium: mark_inode_dirty error
Aug 16 07:09:36 1269 kernel: EXT4-fs error (device sda2) in ext4_dirty_inode:5923: Journal has aborted
Aug 16 07:09:36 1269 kernel: EXT4-fs error (device sda2): ext4_journal_check_start:83: comm chromium: Detected aborted journal
Aug 16 07:09:36 1269 kernel: EXT4-fs (sda2): Remounting filesystem read-only

In all cases I could salvage the FS and prevent a hangup upon (re-)boot via fsck.
An example (with German locales, sorry) here:

$ fsck -y /dev/sda2
fsck von util-linux 2.38.1
e2fsck 1.46.5 (30-Dec-2021)
root: Journal wird wiederhergestellt
root enthält ein fehlerhaftes Dateisystem, Prüfung erzwungen.
Durchgang 1: Inodes, Blöcke und Größen werden geprüft
Inodes wurden gefunden, die Teil einer defekten verketteten Liste von
verwaisten Inodes waren.  Reparieren? ja

Inode 15204591 war Teil der Liste verwaister Inodes.  REPARIERT.
dtime für gelöschten Inode 15204653 ist Null.  Reparieren? ja

Durchgang 2: Verzeichnisstruktur wird geprüft
Durchgang 3: Verzeichnisverknüpfungen werden geprüft
Durchgang 4: Referenzzähler werden überprüft
Durchgang 5: Zusammengefasste Gruppeninformation wird geprüft
Unterschiede in der Block-Bitmap:  -(6877184--6878207) -48955393
Reparieren? ja

Die Anzahl freier Blöcke in Gruppe #209 ist falsch (24781, gezählt=25805).
Reparieren? ja

Die Anzahl freier Blöcke in Gruppe #1494 ist falsch (32757, gezählt=32758).
Reparieren? ja

Die Anzahl freier Blöcke ist falsch (59706261, gezählt=59655030).
Reparieren? ja

Unterschiede in der Inode-Bitmap:  -15204591 -15204653
Reparieren? ja

Die Anzahl freier Inodes für Gruppe #1856 ist falsch (4416, gezählt=4418).
Reparieren? ja

Die Anzahl der Verzeichnisse für Gruppe #1856 ist falsch (568, gezählt=567).
Reparieren? ja

Die Anzahl freier Inodes ist falsch (15505893, gezählt=15505798).
Reparieren? ja


root: ***** DATEISYSTEM WURDE VERÄNDERT *****
root: ***** LINUX MUSS NEU GESTARTET WERDEN *****
root: 91770/15597568 Dateien (0.2% nicht zusammenhängend), 2731483/62386513 Blöcke

On two systems, the errors disappeared with the new kernel parameter pcie_aspm=off.
On another system the I/O errors without the preceding ata2 errors recurred after adding the parameter.
I am at a loss as to what causes the Linux LTS kernel and that SSD model to hate each other in some cases.
Also note that on other, identical systems, the same software and SSD model run flawlessly together.

Last edited by schard (2022-08-18 09:06:38)

seth · 2022-08-18 14:38:13

Aug 16 07:09:35 1269 kernel: sd 1:0:0:0: [sda] tag#1 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=49s
Aug 16 07:09:35 1269 kernel: sd 1:0:0:0: [sda] tag#1 CDB: Write(10) 2a 00 0e f1 af 10 00 00 58 00
Aug 16 07:09:35 1269 kernel: blk_update_request: I/O error, dev sda, sector 250720016 op 0x1:(WRITE) flags 0x800 phys_seg 11 prio class 0

Is there absolutely nothing (ata or not ata) preceeding this? It just times out?
Did you smartctl check those drives?

You wrote:

It ran for over 2 hours without any issues.
I tried downgrading to kernel 5.15.45-1-lts, with which we initially had the issue.

And you before wrote:

On the very same system, Windows 10 runs seamlessly.

Obvious question reg. windows (no, not that one): how long did it run seamlessly and also how long did it take before the error occurred on 5.15.45-1-lts and have you ever encountered it on any other kernel version?

schard · 2022-08-18 15:36:44

seth wrote:

Is there absolutely nothing (ata or not ata) preceeding this? It just times out?

Nope. In one case the previous journal entry was the log from the iperf3 server, but nothing disk i/o-related

seth wrote:

Did you smartctl check those drives?

Yes, no problems detected on either of them. That is what adds to my amazement.
I ran several short and long test. Nothing. Nada. All ok according to smartctl.

seth wrote:

You wrote:
It ran for over 2 hours without any issues.
I tried downgrading to kernel 5.15.45-1-lts, with which we initially had the issue.
And you before wrote:
On the very same system, Windows 10 runs seamlessly.
Obvious question reg. windows (no, not that one): how long did it run seamlessly and also how long did it take before the error occurred on 5.15.45-1-lts and have you ever encountered it on any other kernel version?

Windows ran for approximately two days.
The errors occurred after different intervals on Linux.
On some systems it took several weeks.
When I tested them in the office, one did only take slightly more than one hour to fail (I did a complete reinstall there on the SSD).
Regarding the specific kernel version, I do not recall the exact time, but the issue manifested on several releases of the 5.15.xx series.

Last edited by schard (2022-08-18 15:38:07)

schard · 2022-08-19 07:31:53

Malfunctioning System #1 (1284)
It happened on another machine tonight:
journal

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.15.55-2-lts] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ADATA_IMSS316-256GD
Serial Number:    2L0929C9422U
LU WWN Device Id: 5 707c18 100a1cd7f
Firmware Version: S180718b
User Capacity:    256,060,514,304 bytes [256 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        Not in smartctl database 7.3/5319
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Fri Aug 19 09:24:59 2022 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(   33) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (   2) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x0039)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 0
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       8263
 12 Power_Cycle_Count       0x0012   100   100   000    Old_age   Always       -       17
167 Unknown_Attribute       0x0022   100   100   000    Old_age   Always       -       0
168 Unknown_Attribute       0x0012   100   100   000    Old_age   Always       -       0
169 Unknown_Attribute       0x0013   098   098   010    Pre-fail  Always       -       7340046
173 Unknown_Attribute       0x0012   200   200   000    Old_age   Always       -       12885491718
175 Program_Fail_Count_Chip 0x0013   100   100   010    Pre-fail  Always       -       0
180 Unused_Rsvd_Blk_Cnt_Tot 0x0033   100   100   020    Pre-fail  Always       -       10072
192 Power-Off_Retract_Count 0x0012   100   100   000    Old_age   Always       -       11
194 Temperature_Celsius     0x0022   049   049   030    Old_age   Always       -       51 (Min/Max 22/63)
231 Unknown_SSD_Attribute   0x0033   100   100   005    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       2630788800
234 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       1965500795840
241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       1431578857
242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       1359211899

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      8158         -
# 2  Extended offline    Completed without error       00%      7990         -
# 3  Extended offline    Completed without error       00%      7822         -
# 4  Extended offline    Completed without error       00%      7654         -
# 5  Extended offline    Completed without error       00%      7486         -
# 6  Extended offline    Completed without error       00%      7318         -
# 7  Extended offline    Completed without error       00%      7150         -
# 8  Extended offline    Completed without error       00%      6982         -
# 9  Extended offline    Completed without error       00%      6814         -
#10  Extended offline    Completed without error       00%      6647         -
#11  Extended offline    Completed without error       00%      6481         -
#12  Extended offline    Completed without error       00%      6313         -
#13  Extended offline    Completed without error       00%      6146         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

/dev/sda:

ATA device, with non-removable media
	Model Number:       ADATA_IMSS316-256GD                     
	Serial Number:      2L0929C9422U        
	Firmware Revision:  S180718b
	Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
	Supported: 9 8 7 6 5 
	Likely used: 9
Configuration:
	Logical		max	current
	cylinders	16383	16383
	heads		16	16
	sectors/track	63	63
	--
	CHS current addressable sectors:    16514064
	LBA    user addressable sectors:   268435455
	LBA48  user addressable sectors:   500118192
	Logical  Sector size:                   512 bytes
	Physical Sector size:                   512 bytes
	Logical Sector-0 offset:                  0 bytes
	device size with M = 1024*1024:      244198 MBytes
	device size with M = 1000*1000:      256060 MBytes (256 GB)
	cache/buffer size  = unknown
	Form Factor: 2.5 inch
	Nominal Media Rotation Rate: Solid State Device
Capabilities:
	LBA, IORDY(can be disabled)
	Queue depth: 32
	Standby timer values: spec'd by Standard, no device specific minimum
	R/W multiple sector transfer: Max = 16	Current = 16
	DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
	     Cycle time: min=120ns recommended=120ns
	PIO: pio0 pio1 pio2 pio3 pio4 
	     Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
	Enabled	Supported:
	   *	SMART feature set
	    	Security Mode feature set
	   *	Power Management feature set
	   *	Write cache
	   *	Look-ahead
	   *	NOP cmd
	   *	DOWNLOAD_MICROCODE
	   *	48-bit Address feature set
	   *	Mandatory FLUSH_CACHE
	   *	FLUSH_CACHE_EXT
	   *	SMART error logging
	   *	General Purpose Logging feature set
	   *	WRITE_{DMA|MULTIPLE}_FUA_EXT
	   *	64-bit World wide name
	   *	WRITE_UNCORRECTABLE_EXT command
	   *	{READ,WRITE}_DMA_EXT_GPL commands
	   *	Segmented DOWNLOAD_MICROCODE
	   *	Gen1 signaling speed (1.5Gb/s)
	   *	Gen2 signaling speed (3.0Gb/s)
	   *	Gen3 signaling speed (6.0Gb/s)
	   *	Native Command Queueing (NCQ)
	   *	Phy event counters
	   *	Device automatic Partial to Slumber transitions
	   *	READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
	   *	DMA Setup Auto-Activate optimization
	   *	Device-initiated interface power management
	   *	Software settings preservation
	    	Device Sleep (DEVSLP)
	   *	SMART Command Transport (SCT) feature set
	   *	SCT Error Recovery Control (AC3)
	   *	SCT Features Control (AC4)
	   *	SCT Data Tables (AC5)
	   *	Data Set Management TRIM supported (limit 8 blocks)
	   *	Deterministic read ZEROs after TRIM
Security: 
	Master password revision code = 65534
		supported
	not	enabled
	not	locked
		frozen
	not	expired: security count
		supported: enhanced erase
	2min for SECURITY ERASE UNIT. 2min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 5707c18100a1cd7f
	NAA		: 5
	IEEE OUI	: 707c18
	Unique ID	: 100a1cd7f
Device Sleep:
	DEVSLP Exit Timeout (DETO): 100 ms (drive)
	Minimum DEVSLP Assertion Time (MDAT): 31 ms (drive)
Checksum: correct

PS
Funny thing, the wakeup error occured first some days ago without subsequent i/o errors:

Aug 13 07:43:55 1284 kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x40000 action 0x6 frozen
Aug 13 07:44:13 1284 kernel: ata2: SError: { CommWake }
Aug 13 07:44:13 1284 kernel: ata2.00: failed command: FLUSH CACHE EXT
Aug 13 07:44:13 1284 kernel: ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 23
                                      res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 13 07:44:13 1284 kernel: ata2.00: status: { DRDY }
Aug 13 07:44:13 1284 kernel: ata2: hard resetting link
Aug 13 07:44:13 1284 kernel: ata2: link is slow to respond, please be patient (ready=0)
Aug 13 07:44:13 1284 kernel: ata2: COMRESET failed (errno=-16)
Aug 13 07:44:13 1284 kernel: ata2: hard resetting link
Aug 13 07:44:13 1284 kernel: ata2: link is slow to respond, please be patient (ready=0)
Aug 13 07:44:13 1284 kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 13 07:44:13 1284 kernel: ata2.00: configured for UDMA/133
Aug 13 07:44:13 1284 kernel: ata2.00: retrying FLUSH 0xea Emask 0x4
Aug 13 07:44:13 1284 kernel: ata2: EH complete

Only the night before last, after more than 15 minutes of doing nothing, the first i/o errors happened:

Aug 18 01:06:06 1284 html5ds[45729]: [45729:45729:0818/010606.290100:INFO:CONSOLE(265)] "################ Application started after: 0.825 secs ################", source: http://localhost/javascript/Console.js (265)
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#22 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=33s
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#22 CDB: Write(10) 2a 00 0b 63 c8 68 00 00 08 00
Aug 18 01:23:22 1284 kernel: blk_update_request: I/O error, dev sda, sector 191088744 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Aug 18 01:23:22 1284 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 6553960 starting block 23886094)
Aug 18 01:23:22 1284 kernel: Buffer I/O error on device sda2, logical block 23757837
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#21 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=33s
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#21 CDB: Write(10) 2a 00 0b 63 c8 18 00 00 08 00
Aug 18 01:23:22 1284 kernel: blk_update_request: I/O error, dev sda, sector 191088664 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Aug 18 01:23:22 1284 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 6553960 starting block 23886084)
Aug 18 01:23:22 1284 kernel: Buffer I/O error on device sda2, logical block 23757827
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#20 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=33s
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#20 CDB: Write(10) 2a 00 0b 63 cc 58 00 00 08 00
Aug 18 01:23:22 1284 kernel: blk_update_request: I/O error, dev sda, sector 191089752 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Aug 18 01:23:22 1284 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 6553960 starting block 23886220)
Aug 18 01:23:22 1284 kernel: Buffer I/O error on device sda2, logical block 23757963
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#19 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=33s
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#19 CDB: Write(10) 2a 00 0b 63 cc 08 00 00 08 00
Aug 18 01:23:22 1284 kernel: blk_update_request: I/O error, dev sda, sector 191089672 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Aug 18 01:23:22 1284 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 6553960 starting block 23886210)
Aug 18 01:23:22 1284 kernel: Buffer I/O error on device sda2, logical block 23757953
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#18 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=33s
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#18 CDB: Write(10) 2a 00 0b 63 cb f0 00 00 10 00
Aug 18 01:23:22 1284 kernel: blk_update_request: I/O error, dev sda, sector 191089648 op 0x1:(WRITE) flags 0x0 phys_seg 2 prio class 0
Aug 18 01:23:22 1284 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 6553960 starting block 23886208)
Aug 18 01:23:22 1284 kernel: Buffer I/O error on device sda2, logical block 23757950
Aug 18 01:23:22 1284 kernel: Buffer I/O error on device sda2, logical block 23757951
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#12 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=33s
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#12 CDB: Write(10) 2a 00 0b 63 c8 90 00 00 18 00
Aug 18 01:23:22 1284 kernel: blk_update_request: I/O error, dev sda, sector 191088784 op 0x1:(WRITE) flags 0x0 phys_seg 2 prio class 0
Aug 18 01:23:22 1284 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 6553960 starting block 23886101)
Aug 18 01:23:22 1284 kernel: Buffer I/O error on device sda2, logical block 23757842
Aug 18 01:23:22 1284 kernel: Buffer I/O error on device sda2, logical block 23757843
Aug 18 01:23:22 1284 kernel: Buffer I/O error on device sda2, logical block 23757844
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#11 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=33s
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#11 CDB: Write(10) 2a 00 0b 63 c8 80 00 00 08 00
Aug 18 01:23:22 1284 kernel: blk_update_request: I/O error, dev sda, sector 191088768 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Aug 18 01:23:22 1284 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 6553960 starting block 23886097)
Aug 18 01:23:22 1284 kernel: Buffer I/O error on device sda2, logical block 23757840
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#30 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=30s
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#30 CDB: Write(10) 2a 00 0c 8f aa 80 00 00 08 00
Aug 18 01:23:22 1284 kernel: blk_update_request: I/O error, dev sda, sector 210741888 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
Aug 18 01:23:22 1284 kernel: Buffer I/O error on dev sda2, logical block 26214480, lost async page write
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#29 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=30s
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#29 CDB: Write(10) 2a 00 0c 8f a9 a8 00 00 08 00
Aug 18 01:23:22 1284 kernel: blk_update_request: I/O error, dev sda, sector 210741672 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
Aug 18 01:23:22 1284 kernel: Buffer I/O error on dev sda2, logical block 26214453, lost async page write
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#28 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=30s
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#28 CDB: Write(10) 2a 00 0c 8f a9 10 00 00 08 00
Aug 18 01:23:22 1284 kernel: blk_update_request: I/O error, dev sda, sector 210741520 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
Aug 18 01:23:22 1284 kernel: Buffer I/O error on dev sda2, logical block 26214434, lost async page write
Aug 18 01:23:22 1284 kernel: Buffer I/O error on dev sda2, logical block 26214416, lost async page write
Aug 18 01:23:22 1284 kernel: Buffer I/O error on dev sda2, logical block 25396382, lost async page write
Aug 18 01:23:22 1284 kernel: Buffer I/O error on dev sda2, logical block 1069, lost async page write
Aug 18 01:23:22 1284 kernel: Buffer I/O error on dev sda2, logical block 13, lost async page write
Aug 18 01:23:22 1284 kernel: Buffer I/O error on dev sda2, logical block 0, lost async page write
Aug 18 01:23:22 1284 kernel: Buffer I/O error on dev sda2, logical block 1, lost async page write
Aug 18 01:23:22 1284 kernel: Buffer I/O error on dev sda2, logical block 26214558, lost async page write
Aug 18 01:23:22 1284 kernel: EXT4-fs error (device sda2): ext4_check_bdev_write_error:217: comm nginx: Error while async write back metadata
Aug 18 01:23:22 1284 kernel: EXT4-fs (sda2): previous I/O error to superblock detected
Aug 18 02:37:30 1284 html5ds[45729]: [45729:45756:0818/023730.373484:ERROR:zygote_host_impl_linux.cc(263)] Failed to adjust OOM score of renderer with pid 64298: Keine Berechtigung (13)

Last evening, the kernel finally remounted the root file system read-only, but for no apparent reason:

Aug 18 20:37:35 1284 html5ds[65569]: [65569:65596:0818/203735.458803:ERROR:zygote_host_impl_linux.cc(263)] Failed to adjust OOM score of renderer with pid 66133: Keine Berechtigung (13)
Aug 18 22:48:28 1284 kernel: Aborting journal on device sda2-8.
Aug 18 22:48:28 1284 kernel: EXT4-fs error (device sda2): ext4_journal_check_start:83: comm systemd-journal: Detected aborted journal
Aug 18 22:48:28 1284 kernel: EXT4-fs (sda2): Remounting filesystem read-only
Aug 19 00:00:03 1284 systemd[1]: Started Verify integrity of password and group files.

WTF is going on here?

PPS
After a reboot:
journal

Malfunctioning System #2 (23)
Another system failed over night.
This is one of those, where I added your proposed kernel parameter pcie_aspm=off:
journal

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.15.61-1-lts] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ADATA_IMSS316-256GD
Serial Number:    2L2329QK59HE
LU WWN Device Id: 5 707c18 100a69f6b
Firmware Version: S180718b
User Capacity:    256,060,514,304 bytes [256 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        Not in smartctl database 7.3/5319
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Fri Aug 19 10:06:39 2022 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(   33) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (   2) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x0039)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 0
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       6843
 12 Power_Cycle_Count       0x0012   100   100   000    Old_age   Always       -       6
167 Unknown_Attribute       0x0022   100   100   000    Old_age   Always       -       0
168 Unknown_Attribute       0x0012   100   100   000    Old_age   Always       -       0
169 Unknown_Attribute       0x0013   098   098   010    Pre-fail  Always       -       10485780
173 Unknown_Attribute       0x0012   200   200   000    Old_age   Always       -       17180655622
175 Program_Fail_Count_Chip 0x0013   100   100   010    Pre-fail  Always       -       0
180 Unused_Rsvd_Blk_Cnt_Tot 0x0033   100   100   020    Pre-fail  Always       -       10024
192 Power-Off_Retract_Count 0x0012   100   100   000    Old_age   Always       -       5
194 Temperature_Celsius     0x0022   054   054   030    Old_age   Always       -       46 (Min/Max 17/53)
231 Unknown_SSD_Attribute   0x0033   100   100   005    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       4394782784
234 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       422034482560
241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       1939798177
242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       717736864

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      6777         -
# 2  Short offline       Completed without error       00%      6777         -
# 3  Extended offline    Completed without error       00%      6747         -
# 4  Extended offline    Completed without error       00%      6574         -
# 5  Extended offline    Completed without error       00%      6574         -
# 6  Extended offline    Completed without error       00%      6391         -
# 7  Extended offline    Completed without error       00%      6222         -
# 8  Extended offline    Completed without error       00%      6055         -
# 9  Extended offline    Completed without error       00%      5887         -
#10  Extended offline    Completed without error       00%      5718         -
#11  Extended offline    Completed without error       00%      5550         -
#12  Extended offline    Completed without error       00%      5382         -
#13  Extended offline    Completed without error       00%      5214         -
#14  Extended offline    Completed without error       00%      5046         -
#15  Extended offline    Completed without error       00%      4878         -
#16  Extended offline    Completed without error       00%      4710         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

/dev/sda:

ATA device, with non-removable media
	Model Number:       ADATA_IMSS316-256GD                     
	Serial Number:      2L2329QK59HE        
	Firmware Revision:  S180718b
	Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
	Supported: 9 8 7 6 5 
	Likely used: 9
Configuration:
	Logical		max	current
	cylinders	16383	16383
	heads		16	16
	sectors/track	63	63
	--
	CHS current addressable sectors:    16514064
	LBA    user addressable sectors:   268435455
	LBA48  user addressable sectors:   500118192
	Logical  Sector size:                   512 bytes
	Physical Sector size:                   512 bytes
	Logical Sector-0 offset:                  0 bytes
	device size with M = 1024*1024:      244198 MBytes
	device size with M = 1000*1000:      256060 MBytes (256 GB)
	cache/buffer size  = unknown
	Form Factor: 2.5 inch
	Nominal Media Rotation Rate: Solid State Device
Capabilities:
	LBA, IORDY(can be disabled)
	Queue depth: 32
	Standby timer values: spec'd by Standard, no device specific minimum
	R/W multiple sector transfer: Max = 16	Current = 16
	DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
	     Cycle time: min=120ns recommended=120ns
	PIO: pio0 pio1 pio2 pio3 pio4 
	     Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
	Enabled	Supported:
	   *	SMART feature set
	    	Security Mode feature set
	   *	Power Management feature set
	   *	Write cache
	   *	Look-ahead
	   *	NOP cmd
	   *	DOWNLOAD_MICROCODE
	   *	48-bit Address feature set
	   *	Mandatory FLUSH_CACHE
	   *	FLUSH_CACHE_EXT
	   *	SMART error logging
	   *	General Purpose Logging feature set
	   *	WRITE_{DMA|MULTIPLE}_FUA_EXT
	   *	64-bit World wide name
	   *	WRITE_UNCORRECTABLE_EXT command
	   *	{READ,WRITE}_DMA_EXT_GPL commands
	   *	Segmented DOWNLOAD_MICROCODE
	   *	Gen1 signaling speed (1.5Gb/s)
	   *	Gen2 signaling speed (3.0Gb/s)
	   *	Gen3 signaling speed (6.0Gb/s)
	   *	Native Command Queueing (NCQ)
	   *	Phy event counters
	   *	Device automatic Partial to Slumber transitions
	   *	READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
	   *	DMA Setup Auto-Activate optimization
	   *	Device-initiated interface power management
	   *	Software settings preservation
	    	Device Sleep (DEVSLP)
	   *	SMART Command Transport (SCT) feature set
	   *	SCT Error Recovery Control (AC3)
	   *	SCT Features Control (AC4)
	   *	SCT Data Tables (AC5)
	   *	Data Set Management TRIM supported (limit 8 blocks)
	   *	Deterministic read ZEROs after TRIM
Security: 
	Master password revision code = 65534
		supported
	not	enabled
	not	locked
		frozen
	not	expired: security count
		supported: enhanced erase
	2min for SECURITY ERASE UNIT. 2min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 5707c18100a69f6b
	NAA		: 5
	IEEE OUI	: 707c18
	Unique ID	: 100a69f6b
Device Sleep:
	DEVSLP Exit Timeout (DETO): 100 ms (drive)
	Minimum DEVSLP Assertion Time (MDAT): 31 ms (drive)
Checksum: correct

Additional info
Fortunately, I could prevent the systems from hanging at boot with the prompt to run an fsck, by providing the kernel with the parameter fsck.repair=yes.
Here's the second (23) system's journal after such a reboot:
journal

Last edited by schard (2022-08-19 08:26:57)

seth · 2022-08-19 08:26:40

They *all* have the commwake error (but no ata bus error), even if it's turning fatal on delay, there's some™ issue w/ that…

hdparm -S 0 /dev/sda

?

schard · 2022-08-19 08:29:45

Last login: Fri Aug 19 10:25:23 2022 from fdbc:83e9:4512:ea57::1
0 ✓ 1284 ~ $ hdparm -S 0 /dev/sda

/dev/sda:
 setting standby to 0 (off)

Last login: Fri Aug 19 10:17:54 2022 from fdbc:83e9:4512:ea57::1
0 ✓ 23 ~ $ hdparm -S 0 /dev/sda

/dev/sda:
 setting standby to 0 (off)

And now we wait (I guess?).

Last edited by schard (2022-08-19 08:30:31)

seth · 2022-08-19 08:41:26

You could as a cross-test try to -Y the device, wait a bit and see whether subsequent attempts to access it immediately trigger the error.
The temperatures are all over the place (otherwise I'd have asked you whether you posses a hair-dryer )

schard · 2022-08-19 08:53:54

What do you know. I had journalctl -b --follow running, then issued hdparm -Y /dev/sda and instantly:

Aug 19 10:52:40 23 kernel: ata2.00: exception Emask 0x0 SAct 0x8000 SErr 0x50000 action 0x6
Aug 19 10:52:41 23 kernel: ata2.00: waking up from sleep
Aug 19 10:52:41 23 kernel: ata2: SError: { PHYRdyChg CommWake }
Aug 19 10:52:41 23 kernel: ata2: hard resetting link
Aug 19 10:52:41 23 kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 19 10:52:41 23 kernel: ata2.00: configured for UDMA/133
Aug 19 10:52:41 23 kernel: ata2: EH complete

I accidentally issued the command a second time and got a weird log message:

Aug 19 10:56:27 23 kernel: ata2.00: exception Emask 0x0 SAct 0xff0000 SErr 0x50000 action 0x6
Aug 19 10:56:27 23 kernel: ata2.00: waking up from sleep, waking up from sleep, waking up from sleep, waking up fro
Aug 19 10:56:27 23 kernel: ata2: SError: { PHYRdyChg CommWake }
Aug 19 10:56:27 23 kernel: ata2: hard resetting link
Aug 19 10:56:27 23 kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 19 10:56:27 23 kernel: ata2.00: configured for UDMA/133
Aug 19 10:56:27 23 kernel: ata2: EH complete

The "waking up from sleep" line seems to be repeated multiple times.

Last edited by schard (2022-08-19 09:43:43)

seth · 2022-08-19 14:03:48

It's maybe flushing some stacked events.(?)

It very much looks like the device supports

Aug 18 09:54:36 archlinux kernel: ata2.00: Features: Dev-Sleep

more in theory than in practice.
Waiting whether the no-sleep instance acts up again is the only thing I can suggest now - idk whether there's any libata setting/quirk to apply here, but I guess a udev rule would hopefully suffice…

Arch Linux

#1 2022-07-13 14:06:00

[SOLVED] Spontaneous ro root file system on some machines

#2 2022-07-13 14:19:32

Re: [SOLVED] Spontaneous ro root file system on some machines

#3 2022-07-13 14:24:50

Re: [SOLVED] Spontaneous ro root file system on some machines

#4 2022-07-13 14:49:20

Re: [SOLVED] Spontaneous ro root file system on some machines

#5 2022-07-13 15:18:09

Re: [SOLVED] Spontaneous ro root file system on some machines

#6 2022-07-14 08:32:11

Re: [SOLVED] Spontaneous ro root file system on some machines

#7 2022-07-14 14:11:39

Re: [SOLVED] Spontaneous ro root file system on some machines

#8 2022-07-14 14:28:14

Re: [SOLVED] Spontaneous ro root file system on some machines

#9 2022-07-15 08:29:35

Re: [SOLVED] Spontaneous ro root file system on some machines

#10 2022-08-10 16:44:01

Re: [SOLVED] Spontaneous ro root file system on some machines

#11 2022-08-10 17:17:24

Re: [SOLVED] Spontaneous ro root file system on some machines

#12 2022-08-10 22:16:19

Re: [SOLVED] Spontaneous ro root file system on some machines

#13 2022-08-12 15:14:11

Re: [SOLVED] Spontaneous ro root file system on some machines

#14 2022-08-12 15:26:27

Re: [SOLVED] Spontaneous ro root file system on some machines

#15 2022-08-12 16:09:42

Re: [SOLVED] Spontaneous ro root file system on some machines

#16 2022-08-15 20:55:48

Re: [SOLVED] Spontaneous ro root file system on some machines

#17 2022-08-18 09:05:01

Re: [SOLVED] Spontaneous ro root file system on some machines

#18 2022-08-18 14:38:13

Re: [SOLVED] Spontaneous ro root file system on some machines

#19 2022-08-18 15:36:44

Re: [SOLVED] Spontaneous ro root file system on some machines

#20 2022-08-19 07:31:53

Re: [SOLVED] Spontaneous ro root file system on some machines

#21 2022-08-19 08:26:40

Re: [SOLVED] Spontaneous ro root file system on some machines

#22 2022-08-19 08:29:45

Re: [SOLVED] Spontaneous ro root file system on some machines

#23 2022-08-19 08:41:26

Re: [SOLVED] Spontaneous ro root file system on some machines

#24 2022-08-19 08:53:54

Re: [SOLVED] Spontaneous ro root file system on some machines

#25 2022-08-19 14:03:48

Re: [SOLVED] Spontaneous ro root file system on some machines

Board footer