You are not logged in.
I have the issue that on some machines, the root file system is re-mountet read-only after some time.
The systems are digital signage systems with industry hardware, that run 24/7.
As an example I picked out one of the candidates:
0 ✓ 1261 ~ $ mount
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
sys on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
dev on /dev type devtmpfs (rw,nosuid,relatime,size=1930676k,nr_inodes=482669,mode=755,inode64)
run on /run type tmpfs (rw,nosuid,nodev,relatime,mode=755,inode64)
efivarfs on /sys/firmware/efi/efivars type efivarfs (rw,nosuid,nodev,noexec,relatime)
/dev/sda2 on / type ext4 (ro,relatime)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,inode64)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
bpf on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=30,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=11553)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
tracefs on /sys/kernel/tracing type tracefs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /tmp type tmpfs (rw,nosuid,nodev,size=1955248k,nr_inodes=1048576,inode64)
fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime)
configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
none on /run/credentials/systemd-sysusers.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)
/dev/sda1 on /boot type vfat (rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,utf8,errors=remount-ro)
tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=391048k,nr_inodes=97762,mode=700,inode64)
0 ✓ 1261 ~ $ cat /etc/fstab
# /dev/sda2 UUID=a1982c7a-261b-451b-8dce-570e69bcf568
LABEL=root / ext4 rw,relatime 0 1
# /dev/sda1 UUID=2290-764C
LABEL=EFI /boot vfat rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,utf8,errors=remount-ro 0 2
0 ✓ 1261 ~ $ LANG=C df -h
Filesystem Size Used Avail Use% Mounted on
dev 1.9G 0 1.9G 0% /dev
run 1.9G 820K 1.9G 1% /run
/dev/sda2 234G 7.0G 215G 4% /
tmpfs 1.9G 0 1.9G 0% /dev/shm
tmpfs 1.9G 0 1.9G 0% /tmp
/dev/sda1 499M 53M 447M 11% /boot
tmpfs 382M 0 382M 0% /run/user/0
The interesting part seems to be:
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#21 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=40s
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#21 CDB: Write(10) 2a 00 0e f3 48 28 00 00 18 00
Jun 22 22:20:28 1261 kernel: blk_update_request: I/O error, dev sda, sector 250824744 op 0x1:(WRITE) flags 0x800 phys_seg 3 prio class 0
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#20 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=40s
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#20 CDB: Write(10) 2a 00 00 e2 ea e0 00 00 08 00
Jun 22 22:20:28 1261 kernel: blk_update_request: I/O error, dev sda, sector 14871264 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Jun 22 22:20:28 1261 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 524975 starting block 1858909)
Jun 22 22:20:28 1261 kernel: Buffer I/O error on device sda2, logical block 1730652
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#19 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=40s
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#19 CDB: Write(10) 2a 00 00 e2 ea 18 00 00 18 00
Jun 22 22:20:28 1261 kernel: blk_update_request: I/O error, dev sda, sector 14871064 op 0x1:(WRITE) flags 0x0 phys_seg 2 prio class 0
Jun 22 22:20:28 1261 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 524975 starting block 1858886)
Jun 22 22:20:28 1261 kernel: Buffer I/O error on device sda2, logical block 1730627
Jun 22 22:20:28 1261 kernel: Buffer I/O error on device sda2, logical block 1730628
Jun 22 22:20:28 1261 kernel: Buffer I/O error on device sda2, logical block 1730629
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#31 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=35s
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#31 CDB: Write(10) 2a 00 00 e2 e9 08 00 00 08 00
Jun 22 22:20:28 1261 kernel: blk_update_request: I/O error, dev sda, sector 14870792 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Jun 22 22:20:28 1261 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 524975 starting block 1858850)
Jun 22 22:20:28 1261 kernel: Buffer I/O error on device sda2, logical block 1730593
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#23 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=35s
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#23 CDB: Write(10) 2a 00 00 e2 eb 88 00 00 08 00
Jun 22 22:20:28 1261 kernel: blk_update_request: I/O error, dev sda, sector 14871432 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Jun 22 22:20:28 1261 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 524975 starting block 1858930)
Jun 22 22:20:28 1261 kernel: Buffer I/O error on device sda2, logical block 1730673
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#22 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=35s
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#22 CDB: Write(10) 2a 00 00 e2 eb 00 00 00 08 00
Jun 22 22:20:28 1261 kernel: blk_update_request: I/O error, dev sda, sector 14871296 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Jun 22 22:20:28 1261 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 524975 starting block 1858913)
Jun 22 22:20:28 1261 kernel: Buffer I/O error on device sda2, logical block 1730656
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=35s
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#0 CDB: Write(10) 2a 00 00 e2 e9 d0 00 00 10 00
Jun 22 22:20:28 1261 kernel: blk_update_request: I/O error, dev sda, sector 14870992 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Jun 22 22:20:28 1261 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 524975 starting block 1858876)
Jun 22 22:20:28 1261 kernel: Buffer I/O error on device sda2, logical block 1730618
Jun 22 22:20:28 1261 kernel: Buffer I/O error on device sda2, logical block 1730619
Jun 22 22:20:28 1261 kernel: Aborting journal on device sda2-8.
Jun 22 22:20:28 1261 kernel: EXT4-fs error (device sda2) in ext4_reserve_inode_write:5731: Journal has aborted
Jun 22 22:20:28 1261 kernel: EXT4-fs error (device sda2): ext4_dirty_inode:5927: inode #524975: comm chromium: mark_inode_dirty error
Jun 22 22:20:28 1261 kernel: EXT4-fs error (device sda2) in ext4_dirty_inode:5928: Journal has aborted
Jun 22 22:20:28 1261 kernel: EXT4-fs error (device sda2): ext4_journal_check_start:83: comm systemd-journal: Detected aborted journal
Jun 22 22:20:28 1261 kernel: EXT4-fs (sda2): Remounting filesystem read-only
Jun 22 22:21:06 1261 sshd[73397]: error: kex_exchange_identification: read: Connection reset by peer
Does anybody have an idea why this is happening?
According to smartctl, the disk is fine:
$ smartctl -a /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.15.45-1-lts] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: ADATA_IMSS316-256GD
Serial Number: 2K422L127DKD
LU WWN Device Id: 5 707c18 100985dfd
Firmware Version: S180718b
User Capacity: 256.060.514.304 bytes [256 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available, deterministic, zeroed
Device is: Not in smartctl database 7.3/5319
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Wed Jul 13 16:05:38 2022 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x02) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 249) Self-test routine in progress...
90% of test remaining.
Total time to complete Offline
data collection: ( 33) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 2) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x0039) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 0
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 8442
12 Power_Cycle_Count 0x0012 100 100 000 Old_age Always - 9
167 Unknown_Attribute 0x0022 100 100 000 Old_age Always - 0
168 Unknown_Attribute 0x0012 100 100 000 Old_age Always - 0
169 Unknown_Attribute 0x0013 099 099 010 Pre-fail Always - 4194312
173 Unknown_Attribute 0x0012 200 200 000 Old_age Always - 12885229571
175 Program_Fail_Count_Chip 0x0013 100 100 010 Pre-fail Always - 0
180 Unused_Rsvd_Blk_Cnt_Tot 0x0033 100 100 020 Pre-fail Always - 10120
192 Power-Off_Retract_Count 0x0012 100 100 000 Old_age Always - 7
194 Temperature_Celsius 0x0022 055 055 030 Old_age Always - 45 (Min/Max 21/61)
231 Unknown_SSD_Attribute 0x0033 100 100 005 Pre-fail Always - 0
233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 1514614464
234 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 3033051297024
241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 542263652
242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 238273046
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 8384 -
# 2 Extended offline Completed without error 00% 8206 -
# 3 Extended offline Completed without error 00% 8028 -
# 4 Extended offline Completed without error 00% 7844 -
# 5 Extended offline Completed without error 00% 7676 -
# 6 Extended offline Completed without error 00% 7508 -
# 7 Extended offline Completed without error 00% 7341 -
# 8 Extended offline Completed without error 00% 7173 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Last edited by schard (2023-08-31 20:50:45)
Inofficial first vice president of the Rust Evangelism Strike Force
Offline
The change to read only is usually an indication of a I/O error, which it appears has happened in your case.
Are these SSDs or rotating disks? Are these indoor or outdoor displays? I understand ambient temperatures have been near 30C in (what I assume) is your part of the planet. With solar loading and normal operating temperature rise you could well be bumping your head on the maximum operating temperatures for commercial rated equipment. Heck, around here with solar loading our passively cooled systems can blow through the operating temperature limits of industrial rated drives
Last edited by ewaller (2022-07-13 14:19:57)
Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way
Offline
The systems are mounted indoors.
They are tenant information systems (digital black boards) mounted on the walls near the entry of stair cases in apartment buildings.
Inofficial first vice president of the Rust Evangelism Strike Force
Offline
* cable
* cable
* power supply
* cable
* power saving
* cable
If this happens somewhat predictably, can you prevent it by keeping the drive busy, eg. frequently read random files (to avoid the cache) into /dev/null or so?
Or, delay the low-power (hdparm -S 251, check the manpage on the parameter) or maybe vv. trigger the behavior by sending the drive to sleep.
Also keep in mind any power management daemons - and aspm might spoil you if it's the bus.
Did I mention the cable?
Offline
Seth, I agree about the cable. What I am hung up on is that this is multiple machines. That leads me to want to look for something systemic.
Schard, how long have these been fielded? Were they all fielded at the same time, or where they installed progressively over time?
Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way
Offline
@seth I assume with "cable" you are referring to the connection between the mainboard and the SSD?
The SSD is an mSATA SSD and is thusly directly plugged into a socket on the mainboard.
While it is unlikely that dust has come between the contacts and plug, I will check this out once the device has been brought in.
I will also check any possible power saving settings.
However, we have hundreds of identical devices (all running Arch Linux) out in the field and this happens only on a tiny fraction of those.
Another problem is, that after issuing a reboot, the devices do not boot anymore, but usually hang with this error message.
@ewaller
The systems have been out in the field for different amounts of time. Some are new, some are older. The system in question is rather new and has been deployed for about 2 1/2 years.
Last edited by schard (2022-07-14 08:33:13)
Inofficial first vice president of the Rust Evangelism Strike Force
Offline
This is a clear case of bitrot
What fixes them? Reinstalling? Reformatting? Replacing the drive?
My hunch remains is that it is systemic and is environmental, though I will never discount cables. I should mention, cables are the bane of every engineer's existence, myself included.
If is is not thermal, perhaps EMC/EMI? Unlikely to be shock, vibration, humidity, sand, dust, fungal growth or corrosion due to salt fog.
What do the power supplies look like? I am sure you are using stuff that meet the CE directives for conducted susceptibility for low voltage systems -- but mains power can get ugly this time of year too. Any signs other systems are spontaneously rebooting? Any evidence of power failures across the network?
Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way
Offline
The SSD is an mSATA SSD and is thusly directly plugged into a socket on the mainboard.
While it is unlikely that dust has come between the contacts and plug, I will check this out once the device has been brought in.
For mSATAs, the demon is usually the securing screw/s
They can end up bending the card out of the slot and certainly get it under a lot of tension. You might even have lost the slot, by the card ripping it off the mainboard.
If all devices are the same, this could come down to how tight the screw was fixed - or whether it's canted.
However, we have hundreds of identical devices (all running Arch Linux) out in the field and this happens only on a tiny fraction of those.
Suggests it's the HW, not the SW.
Offline
What fixes them? Reinstalling? Reformatting? Replacing the drive?
Indeed in many cases a re-installation of the whole OS fixes the issue and the systems run again fine with the same disks.
This it what baffles me. Of course also the hardware manufacturer uses this fact as an excuse that the error must be software-related and not related to their hardware.
What do the power supplies look like?
The power supplies are built into the systems and are not accessible from the access cover for the main board.
I'll have to ask the hardware manufacturer for details.
Any signs other systems are spontaneously rebooting?
None that I know of. The systems run fine 24/7 without reboots.
The only reboots are done on the more or less regular "patch days".
Any evidence of power failures across the network?
Unfortunately it is hard to come by this kind of information.
The systems are spread all over the country and power grid failures are a fairly seldom occasion here in Germany and mostly a local occurence.
I know that we had down times in the past due to power failures in at least two cities, but none of them resulted in defects like this.
The systems just booted up without any issues once power was restored.
For mSATAs, the demon is usually the securing screw/s
I will definitely check this out.
I expect the system to be brought in within the upcoming week.
Inofficial first vice president of the Rust Evangelism Strike Force
Offline
So. After analyzing a faulty system this week, I found out, by just swapping around SSDs, that the SSD of the faulty system is defective. It produces the same errors when plugged into another system and the faulty system runs fine with another SSD.
Inofficial first vice president of the Rust Evangelism Strike Force
Offline
Glad you got it sorted, but I thought you had said 'some' machines? That led me to look for something more systemic.
Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way
Offline
Yes there are more, but I doubt I will get to have a look at them, because, well, costs.
Inofficial first vice president of the Rust Evangelism Strike Force
Offline
I have to amend, that this topic is not yet solved.
We decided to test the disk further.
On the very same system, Windows 10 runs seamlessly.
Chkdsk does not detect any errors with the disk during an extended check (chkdsk /R).
Also the ADATA SSD ToolBox finds nothing wrong with the drive in either the short or extended test.
I am starting to consider a kernel bug.
Inofficial first vice president of the Rust Evangelism Strike Force
Offline
Also keep in mind any power management daemons - and aspm might spoil you if it's the bus.
Tried "pcie_aspm=off" ?
Offline
Not yet, thanks for the reminder. Will try on Monday.
Inofficial first vice president of the Rust Evangelism Strike Force
Offline
I tested the SSD with
# shred -z /dev/sda
from a live system, which terminated after 2 1/2 hours or so without any errors (exit code 0, no output)
No I/O errors so far.
Some more details:
dmesg
journalctl -b
/proc/cmdline
lspci -nnk
hdparm -I /dev/sda
smartctl -a /dev/sda
Update
I reinstalled our most recent system image.
It ran for over 2 hours without any issues.
I tried downgrading to kernel 5.15.45-1-lts, with which we initially had the issue.
So far the system even runs on this kernel without any issues: http://ix.io/47L8
I am absolutely baffled as to why the I/O errors suddenly disappeared.
Update 2
I got the same general issue on three other systems.
They have an additional error message before the I/O error occurs:
Aug 16 06:33:11 1269 kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x40000 action 0x6 frozen
Aug 16 06:33:13 1269 kernel: ata2: SError: { CommWake }
Aug 16 06:33:13 1269 kernel: ata2.00: failed command: FLUSH CACHE EXT
Aug 16 06:33:13 1269 kernel: ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 2
res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 16 06:33:13 1269 kernel: ata2.00: status: { DRDY }
Aug 16 06:33:13 1269 kernel: ata2: hard resetting link
Aug 16 06:33:13 1269 kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 16 06:33:13 1269 kernel: ata2.00: configured for UDMA/133
Aug 16 06:33:13 1269 kernel: ata2.00: retrying FLUSH 0xea Emask 0x4
Aug 16 06:33:13 1269 kernel: ata2: EH complete
Aug 16 07:09:35 1269 kernel: sd 1:0:0:0: [sda] tag#1 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=49s
Aug 16 07:09:35 1269 kernel: sd 1:0:0:0: [sda] tag#1 CDB: Write(10) 2a 00 0e f1 af 10 00 00 58 00
Aug 16 07:09:35 1269 kernel: blk_update_request: I/O error, dev sda, sector 250720016 op 0x1:(WRITE) flags 0x800 phys_seg 11 prio class 0
Aug 16 07:09:35 1269 kernel: Aborting journal on device sda2-8.
Aug 16 07:09:35 1269 kernel: EXT4-fs error (device sda2) in ext4_do_update_inode:5127: Journal has aborted
Aug 16 07:09:35 1269 kernel: EXT4-fs error (device sda2): ext4_dirty_inode:5922: inode #2491101: comm Chrome_IOThread: mark_inode_dirty error
Aug 16 07:09:35 1269 kernel: EXT4-fs error (device sda2) in ext4_reserve_inode_write:5726: Journal has aborted
Aug 16 07:09:35 1269 kernel: EXT4-fs error (device sda2) in ext4_dirty_inode:5923: IO failure
Aug 16 07:09:35 1269 kernel: EXT4-fs error (device sda2): ext4_dirty_inode:5922: inode #2491101: comm MemoryInfra: mark_inode_dirty error
Aug 16 07:09:36 1269 kernel: EXT4-fs error (device sda2) in ext4_dirty_inode:5923: Journal has aborted
Aug 16 07:09:36 1269 kernel: EXT4-fs error (device sda2) in ext4_reserve_inode_write:5726: Journal has aborted
Aug 16 07:09:36 1269 kernel: EXT4-fs error (device sda2): ext4_dirty_inode:5922: inode #2491101: comm chromium: mark_inode_dirty error
Aug 16 07:09:36 1269 kernel: EXT4-fs error (device sda2) in ext4_dirty_inode:5923: Journal has aborted
Aug 16 07:09:36 1269 kernel: EXT4-fs error (device sda2): ext4_journal_check_start:83: comm chromium: Detected aborted journal
Aug 16 07:09:36 1269 kernel: EXT4-fs (sda2): Remounting filesystem read-only
If I interpret this correctly, there is some issue with waking up the disk from a suspended state.
Time to finally try Seth's advise.
Last edited by schard (2022-08-16 13:58:26)
Inofficial first vice president of the Rust Evangelism Strike Force
Offline
So, after a few more tests, I found out, that the issue only occurs on systems with SSDs of the type ADATA_IMSS316-256GD.
In some cases, before the I/O errors happen, there are the aforementioned ata2 errors:
Aug 16 06:33:11 1269 kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x40000 action 0x6 frozen
Aug 16 06:33:13 1269 kernel: ata2: SError: { CommWake }
Aug 16 06:33:13 1269 kernel: ata2.00: failed command: FLUSH CACHE EXT
Aug 16 06:33:13 1269 kernel: ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 2
res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 16 06:33:13 1269 kernel: ata2.00: status: { DRDY }
Aug 16 06:33:13 1269 kernel: ata2: hard resetting link
Aug 16 06:33:13 1269 kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 16 06:33:13 1269 kernel: ata2.00: configured for UDMA/133
Aug 16 06:33:13 1269 kernel: ata2.00: retrying FLUSH 0xea Emask 0x4
Aug 16 06:33:13 1269 kernel: ata2: EH complete
however, on other system, those do not occur and the journal only shows the I/O errors with the subsequent ro-remounts:
Aug 16 07:09:35 1269 kernel: sd 1:0:0:0: [sda] tag#1 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=49s
Aug 16 07:09:35 1269 kernel: sd 1:0:0:0: [sda] tag#1 CDB: Write(10) 2a 00 0e f1 af 10 00 00 58 00
Aug 16 07:09:35 1269 kernel: blk_update_request: I/O error, dev sda, sector 250720016 op 0x1:(WRITE) flags 0x800 phys_seg 11 prio class 0
Aug 16 07:09:35 1269 kernel: Aborting journal on device sda2-8.
Aug 16 07:09:35 1269 kernel: EXT4-fs error (device sda2) in ext4_do_update_inode:5127: Journal has aborted
Aug 16 07:09:35 1269 kernel: EXT4-fs error (device sda2): ext4_dirty_inode:5922: inode #2491101: comm Chrome_IOThread: mark_inode_dirty error
Aug 16 07:09:35 1269 kernel: EXT4-fs error (device sda2) in ext4_reserve_inode_write:5726: Journal has aborted
Aug 16 07:09:35 1269 kernel: EXT4-fs error (device sda2) in ext4_dirty_inode:5923: IO failure
Aug 16 07:09:35 1269 kernel: EXT4-fs error (device sda2): ext4_dirty_inode:5922: inode #2491101: comm MemoryInfra: mark_inode_dirty error
Aug 16 07:09:36 1269 kernel: EXT4-fs error (device sda2) in ext4_dirty_inode:5923: Journal has aborted
Aug 16 07:09:36 1269 kernel: EXT4-fs error (device sda2) in ext4_reserve_inode_write:5726: Journal has aborted
Aug 16 07:09:36 1269 kernel: EXT4-fs error (device sda2): ext4_dirty_inode:5922: inode #2491101: comm chromium: mark_inode_dirty error
Aug 16 07:09:36 1269 kernel: EXT4-fs error (device sda2) in ext4_dirty_inode:5923: Journal has aborted
Aug 16 07:09:36 1269 kernel: EXT4-fs error (device sda2): ext4_journal_check_start:83: comm chromium: Detected aborted journal
Aug 16 07:09:36 1269 kernel: EXT4-fs (sda2): Remounting filesystem read-only
In all cases I could salvage the FS and prevent a hangup upon (re-)boot via fsck.
An example (with German locales, sorry) here:
$ fsck -y /dev/sda2
fsck von util-linux 2.38.1
e2fsck 1.46.5 (30-Dec-2021)
root: Journal wird wiederhergestellt
root enthält ein fehlerhaftes Dateisystem, Prüfung erzwungen.
Durchgang 1: Inodes, Blöcke und Größen werden geprüft
Inodes wurden gefunden, die Teil einer defekten verketteten Liste von
verwaisten Inodes waren. Reparieren? ja
Inode 15204591 war Teil der Liste verwaister Inodes. REPARIERT.
dtime für gelöschten Inode 15204653 ist Null. Reparieren? ja
Durchgang 2: Verzeichnisstruktur wird geprüft
Durchgang 3: Verzeichnisverknüpfungen werden geprüft
Durchgang 4: Referenzzähler werden überprüft
Durchgang 5: Zusammengefasste Gruppeninformation wird geprüft
Unterschiede in der Block-Bitmap: -(6877184--6878207) -48955393
Reparieren? ja
Die Anzahl freier Blöcke in Gruppe #209 ist falsch (24781, gezählt=25805).
Reparieren? ja
Die Anzahl freier Blöcke in Gruppe #1494 ist falsch (32757, gezählt=32758).
Reparieren? ja
Die Anzahl freier Blöcke ist falsch (59706261, gezählt=59655030).
Reparieren? ja
Unterschiede in der Inode-Bitmap: -15204591 -15204653
Reparieren? ja
Die Anzahl freier Inodes für Gruppe #1856 ist falsch (4416, gezählt=4418).
Reparieren? ja
Die Anzahl der Verzeichnisse für Gruppe #1856 ist falsch (568, gezählt=567).
Reparieren? ja
Die Anzahl freier Inodes ist falsch (15505893, gezählt=15505798).
Reparieren? ja
root: ***** DATEISYSTEM WURDE VERÄNDERT *****
root: ***** LINUX MUSS NEU GESTARTET WERDEN *****
root: 91770/15597568 Dateien (0.2% nicht zusammenhängend), 2731483/62386513 Blöcke
On two systems, the errors disappeared with the new kernel parameter pcie_aspm=off.
On another system the I/O errors without the preceding ata2 errors recurred after adding the parameter.
I am at a loss as to what causes the Linux LTS kernel and that SSD model to hate each other in some cases.
Also note that on other, identical systems, the same software and SSD model run flawlessly together.
Last edited by schard (2022-08-18 09:06:38)
Inofficial first vice president of the Rust Evangelism Strike Force
Offline
Aug 16 07:09:35 1269 kernel: sd 1:0:0:0: [sda] tag#1 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=49s
Aug 16 07:09:35 1269 kernel: sd 1:0:0:0: [sda] tag#1 CDB: Write(10) 2a 00 0e f1 af 10 00 00 58 00
Aug 16 07:09:35 1269 kernel: blk_update_request: I/O error, dev sda, sector 250720016 op 0x1:(WRITE) flags 0x800 phys_seg 11 prio class 0
Is there absolutely nothing (ata or not ata) preceeding this? It just times out?
Did you smartctl check those drives?
It ran for over 2 hours without any issues.
I tried downgrading to kernel 5.15.45-1-lts, with which we initially had the issue.
On the very same system, Windows 10 runs seamlessly.
Obvious question reg. windows (no, not that one): how long did it run seamlessly and also how long did it take before the error occurred on 5.15.45-1-lts and have you ever encountered it on any other kernel version?
Offline
Is there absolutely nothing (ata or not ata) preceeding this? It just times out?
Nope. In one case the previous journal entry was the log from the iperf3 server, but nothing disk i/o-related
Did you smartctl check those drives?
Yes, no problems detected on either of them. That is what adds to my amazement.
I ran several short and long test. Nothing. Nada. All ok according to smartctl.
You wrote:It ran for over 2 hours without any issues.
I tried downgrading to kernel 5.15.45-1-lts, with which we initially had the issue.And you before wrote:On the very same system, Windows 10 runs seamlessly.
Obvious question reg. windows (no, not that one): how long did it run seamlessly and also how long did it take before the error occurred on 5.15.45-1-lts and have you ever encountered it on any other kernel version?
Windows ran for approximately two days.
The errors occurred after different intervals on Linux.
On some systems it took several weeks.
When I tested them in the office, one did only take slightly more than one hour to fail (I did a complete reinstall there on the SSD).
Regarding the specific kernel version, I do not recall the exact time, but the issue manifested on several releases of the 5.15.xx series.
Last edited by schard (2022-08-18 15:38:07)
Inofficial first vice president of the Rust Evangelism Strike Force
Offline
Malfunctioning System #1 (1284)
It happened on another machine tonight:
journal
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.15.55-2-lts] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: ADATA_IMSS316-256GD
Serial Number: 2L0929C9422U
LU WWN Device Id: 5 707c18 100a1cd7f
Firmware Version: S180718b
User Capacity: 256,060,514,304 bytes [256 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available, deterministic, zeroed
Device is: Not in smartctl database 7.3/5319
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Fri Aug 19 09:24:59 2022 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x02) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 33) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 2) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x0039) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 0
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 8263
12 Power_Cycle_Count 0x0012 100 100 000 Old_age Always - 17
167 Unknown_Attribute 0x0022 100 100 000 Old_age Always - 0
168 Unknown_Attribute 0x0012 100 100 000 Old_age Always - 0
169 Unknown_Attribute 0x0013 098 098 010 Pre-fail Always - 7340046
173 Unknown_Attribute 0x0012 200 200 000 Old_age Always - 12885491718
175 Program_Fail_Count_Chip 0x0013 100 100 010 Pre-fail Always - 0
180 Unused_Rsvd_Blk_Cnt_Tot 0x0033 100 100 020 Pre-fail Always - 10072
192 Power-Off_Retract_Count 0x0012 100 100 000 Old_age Always - 11
194 Temperature_Celsius 0x0022 049 049 030 Old_age Always - 51 (Min/Max 22/63)
231 Unknown_SSD_Attribute 0x0033 100 100 005 Pre-fail Always - 0
233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 2630788800
234 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 1965500795840
241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 1431578857
242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 1359211899
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 8158 -
# 2 Extended offline Completed without error 00% 7990 -
# 3 Extended offline Completed without error 00% 7822 -
# 4 Extended offline Completed without error 00% 7654 -
# 5 Extended offline Completed without error 00% 7486 -
# 6 Extended offline Completed without error 00% 7318 -
# 7 Extended offline Completed without error 00% 7150 -
# 8 Extended offline Completed without error 00% 6982 -
# 9 Extended offline Completed without error 00% 6814 -
#10 Extended offline Completed without error 00% 6647 -
#11 Extended offline Completed without error 00% 6481 -
#12 Extended offline Completed without error 00% 6313 -
#13 Extended offline Completed without error 00% 6146 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
/dev/sda:
ATA device, with non-removable media
Model Number: ADATA_IMSS316-256GD
Serial Number: 2L0929C9422U
Firmware Revision: S180718b
Transport: Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
Supported: 9 8 7 6 5
Likely used: 9
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 500118192
Logical Sector size: 512 bytes
Physical Sector size: 512 bytes
Logical Sector-0 offset: 0 bytes
device size with M = 1024*1024: 244198 MBytes
device size with M = 1000*1000: 256060 MBytes (256 GB)
cache/buffer size = unknown
Form Factor: 2.5 inch
Nominal Media Rotation Rate: Solid State Device
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16 Current = 16
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
* NOP cmd
* DOWNLOAD_MICROCODE
* 48-bit Address feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART error logging
* General Purpose Logging feature set
* WRITE_{DMA|MULTIPLE}_FUA_EXT
* 64-bit World wide name
* WRITE_UNCORRECTABLE_EXT command
* {READ,WRITE}_DMA_EXT_GPL commands
* Segmented DOWNLOAD_MICROCODE
* Gen1 signaling speed (1.5Gb/s)
* Gen2 signaling speed (3.0Gb/s)
* Gen3 signaling speed (6.0Gb/s)
* Native Command Queueing (NCQ)
* Phy event counters
* Device automatic Partial to Slumber transitions
* READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
* DMA Setup Auto-Activate optimization
* Device-initiated interface power management
* Software settings preservation
Device Sleep (DEVSLP)
* SMART Command Transport (SCT) feature set
* SCT Error Recovery Control (AC3)
* SCT Features Control (AC4)
* SCT Data Tables (AC5)
* Data Set Management TRIM supported (limit 8 blocks)
* Deterministic read ZEROs after TRIM
Security:
Master password revision code = 65534
supported
not enabled
not locked
frozen
not expired: security count
supported: enhanced erase
2min for SECURITY ERASE UNIT. 2min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 5707c18100a1cd7f
NAA : 5
IEEE OUI : 707c18
Unique ID : 100a1cd7f
Device Sleep:
DEVSLP Exit Timeout (DETO): 100 ms (drive)
Minimum DEVSLP Assertion Time (MDAT): 31 ms (drive)
Checksum: correct
PS
Funny thing, the wakeup error occured first some days ago without subsequent i/o errors:
Aug 13 07:43:55 1284 kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x40000 action 0x6 frozen
Aug 13 07:44:13 1284 kernel: ata2: SError: { CommWake }
Aug 13 07:44:13 1284 kernel: ata2.00: failed command: FLUSH CACHE EXT
Aug 13 07:44:13 1284 kernel: ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 23
res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 13 07:44:13 1284 kernel: ata2.00: status: { DRDY }
Aug 13 07:44:13 1284 kernel: ata2: hard resetting link
Aug 13 07:44:13 1284 kernel: ata2: link is slow to respond, please be patient (ready=0)
Aug 13 07:44:13 1284 kernel: ata2: COMRESET failed (errno=-16)
Aug 13 07:44:13 1284 kernel: ata2: hard resetting link
Aug 13 07:44:13 1284 kernel: ata2: link is slow to respond, please be patient (ready=0)
Aug 13 07:44:13 1284 kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 13 07:44:13 1284 kernel: ata2.00: configured for UDMA/133
Aug 13 07:44:13 1284 kernel: ata2.00: retrying FLUSH 0xea Emask 0x4
Aug 13 07:44:13 1284 kernel: ata2: EH complete
Only the night before last, after more than 15 minutes of doing nothing, the first i/o errors happened:
Aug 18 01:06:06 1284 html5ds[45729]: [45729:45729:0818/010606.290100:INFO:CONSOLE(265)] "################ Application started after: 0.825 secs ################", source: http://localhost/javascript/Console.js (265)
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#22 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=33s
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#22 CDB: Write(10) 2a 00 0b 63 c8 68 00 00 08 00
Aug 18 01:23:22 1284 kernel: blk_update_request: I/O error, dev sda, sector 191088744 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Aug 18 01:23:22 1284 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 6553960 starting block 23886094)
Aug 18 01:23:22 1284 kernel: Buffer I/O error on device sda2, logical block 23757837
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#21 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=33s
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#21 CDB: Write(10) 2a 00 0b 63 c8 18 00 00 08 00
Aug 18 01:23:22 1284 kernel: blk_update_request: I/O error, dev sda, sector 191088664 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Aug 18 01:23:22 1284 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 6553960 starting block 23886084)
Aug 18 01:23:22 1284 kernel: Buffer I/O error on device sda2, logical block 23757827
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#20 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=33s
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#20 CDB: Write(10) 2a 00 0b 63 cc 58 00 00 08 00
Aug 18 01:23:22 1284 kernel: blk_update_request: I/O error, dev sda, sector 191089752 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Aug 18 01:23:22 1284 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 6553960 starting block 23886220)
Aug 18 01:23:22 1284 kernel: Buffer I/O error on device sda2, logical block 23757963
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#19 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=33s
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#19 CDB: Write(10) 2a 00 0b 63 cc 08 00 00 08 00
Aug 18 01:23:22 1284 kernel: blk_update_request: I/O error, dev sda, sector 191089672 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Aug 18 01:23:22 1284 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 6553960 starting block 23886210)
Aug 18 01:23:22 1284 kernel: Buffer I/O error on device sda2, logical block 23757953
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#18 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=33s
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#18 CDB: Write(10) 2a 00 0b 63 cb f0 00 00 10 00
Aug 18 01:23:22 1284 kernel: blk_update_request: I/O error, dev sda, sector 191089648 op 0x1:(WRITE) flags 0x0 phys_seg 2 prio class 0
Aug 18 01:23:22 1284 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 6553960 starting block 23886208)
Aug 18 01:23:22 1284 kernel: Buffer I/O error on device sda2, logical block 23757950
Aug 18 01:23:22 1284 kernel: Buffer I/O error on device sda2, logical block 23757951
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#12 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=33s
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#12 CDB: Write(10) 2a 00 0b 63 c8 90 00 00 18 00
Aug 18 01:23:22 1284 kernel: blk_update_request: I/O error, dev sda, sector 191088784 op 0x1:(WRITE) flags 0x0 phys_seg 2 prio class 0
Aug 18 01:23:22 1284 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 6553960 starting block 23886101)
Aug 18 01:23:22 1284 kernel: Buffer I/O error on device sda2, logical block 23757842
Aug 18 01:23:22 1284 kernel: Buffer I/O error on device sda2, logical block 23757843
Aug 18 01:23:22 1284 kernel: Buffer I/O error on device sda2, logical block 23757844
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#11 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=33s
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#11 CDB: Write(10) 2a 00 0b 63 c8 80 00 00 08 00
Aug 18 01:23:22 1284 kernel: blk_update_request: I/O error, dev sda, sector 191088768 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Aug 18 01:23:22 1284 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 6553960 starting block 23886097)
Aug 18 01:23:22 1284 kernel: Buffer I/O error on device sda2, logical block 23757840
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#30 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=30s
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#30 CDB: Write(10) 2a 00 0c 8f aa 80 00 00 08 00
Aug 18 01:23:22 1284 kernel: blk_update_request: I/O error, dev sda, sector 210741888 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
Aug 18 01:23:22 1284 kernel: Buffer I/O error on dev sda2, logical block 26214480, lost async page write
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#29 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=30s
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#29 CDB: Write(10) 2a 00 0c 8f a9 a8 00 00 08 00
Aug 18 01:23:22 1284 kernel: blk_update_request: I/O error, dev sda, sector 210741672 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
Aug 18 01:23:22 1284 kernel: Buffer I/O error on dev sda2, logical block 26214453, lost async page write
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#28 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=30s
Aug 18 01:23:22 1284 kernel: sd 1:0:0:0: [sda] tag#28 CDB: Write(10) 2a 00 0c 8f a9 10 00 00 08 00
Aug 18 01:23:22 1284 kernel: blk_update_request: I/O error, dev sda, sector 210741520 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
Aug 18 01:23:22 1284 kernel: Buffer I/O error on dev sda2, logical block 26214434, lost async page write
Aug 18 01:23:22 1284 kernel: Buffer I/O error on dev sda2, logical block 26214416, lost async page write
Aug 18 01:23:22 1284 kernel: Buffer I/O error on dev sda2, logical block 25396382, lost async page write
Aug 18 01:23:22 1284 kernel: Buffer I/O error on dev sda2, logical block 1069, lost async page write
Aug 18 01:23:22 1284 kernel: Buffer I/O error on dev sda2, logical block 13, lost async page write
Aug 18 01:23:22 1284 kernel: Buffer I/O error on dev sda2, logical block 0, lost async page write
Aug 18 01:23:22 1284 kernel: Buffer I/O error on dev sda2, logical block 1, lost async page write
Aug 18 01:23:22 1284 kernel: Buffer I/O error on dev sda2, logical block 26214558, lost async page write
Aug 18 01:23:22 1284 kernel: EXT4-fs error (device sda2): ext4_check_bdev_write_error:217: comm nginx: Error while async write back metadata
Aug 18 01:23:22 1284 kernel: EXT4-fs (sda2): previous I/O error to superblock detected
Aug 18 02:37:30 1284 html5ds[45729]: [45729:45756:0818/023730.373484:ERROR:zygote_host_impl_linux.cc(263)] Failed to adjust OOM score of renderer with pid 64298: Keine Berechtigung (13)
Last evening, the kernel finally remounted the root file system read-only, but for no apparent reason:
Aug 18 20:37:35 1284 html5ds[65569]: [65569:65596:0818/203735.458803:ERROR:zygote_host_impl_linux.cc(263)] Failed to adjust OOM score of renderer with pid 66133: Keine Berechtigung (13)
Aug 18 22:48:28 1284 kernel: Aborting journal on device sda2-8.
Aug 18 22:48:28 1284 kernel: EXT4-fs error (device sda2): ext4_journal_check_start:83: comm systemd-journal: Detected aborted journal
Aug 18 22:48:28 1284 kernel: EXT4-fs (sda2): Remounting filesystem read-only
Aug 19 00:00:03 1284 systemd[1]: Started Verify integrity of password and group files.
WTF is going on here?
PPS
After a reboot:
journal
Malfunctioning System #2 (23)
Another system failed over night.
This is one of those, where I added your proposed kernel parameter pcie_aspm=off:
journal
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.15.61-1-lts] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: ADATA_IMSS316-256GD
Serial Number: 2L2329QK59HE
LU WWN Device Id: 5 707c18 100a69f6b
Firmware Version: S180718b
User Capacity: 256,060,514,304 bytes [256 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available, deterministic, zeroed
Device is: Not in smartctl database 7.3/5319
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Fri Aug 19 10:06:39 2022 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x02) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 33) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 2) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x0039) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 0
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 6843
12 Power_Cycle_Count 0x0012 100 100 000 Old_age Always - 6
167 Unknown_Attribute 0x0022 100 100 000 Old_age Always - 0
168 Unknown_Attribute 0x0012 100 100 000 Old_age Always - 0
169 Unknown_Attribute 0x0013 098 098 010 Pre-fail Always - 10485780
173 Unknown_Attribute 0x0012 200 200 000 Old_age Always - 17180655622
175 Program_Fail_Count_Chip 0x0013 100 100 010 Pre-fail Always - 0
180 Unused_Rsvd_Blk_Cnt_Tot 0x0033 100 100 020 Pre-fail Always - 10024
192 Power-Off_Retract_Count 0x0012 100 100 000 Old_age Always - 5
194 Temperature_Celsius 0x0022 054 054 030 Old_age Always - 46 (Min/Max 17/53)
231 Unknown_SSD_Attribute 0x0033 100 100 005 Pre-fail Always - 0
233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 4394782784
234 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 422034482560
241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 1939798177
242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 717736864
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 6777 -
# 2 Short offline Completed without error 00% 6777 -
# 3 Extended offline Completed without error 00% 6747 -
# 4 Extended offline Completed without error 00% 6574 -
# 5 Extended offline Completed without error 00% 6574 -
# 6 Extended offline Completed without error 00% 6391 -
# 7 Extended offline Completed without error 00% 6222 -
# 8 Extended offline Completed without error 00% 6055 -
# 9 Extended offline Completed without error 00% 5887 -
#10 Extended offline Completed without error 00% 5718 -
#11 Extended offline Completed without error 00% 5550 -
#12 Extended offline Completed without error 00% 5382 -
#13 Extended offline Completed without error 00% 5214 -
#14 Extended offline Completed without error 00% 5046 -
#15 Extended offline Completed without error 00% 4878 -
#16 Extended offline Completed without error 00% 4710 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
/dev/sda:
ATA device, with non-removable media
Model Number: ADATA_IMSS316-256GD
Serial Number: 2L2329QK59HE
Firmware Revision: S180718b
Transport: Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
Supported: 9 8 7 6 5
Likely used: 9
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 500118192
Logical Sector size: 512 bytes
Physical Sector size: 512 bytes
Logical Sector-0 offset: 0 bytes
device size with M = 1024*1024: 244198 MBytes
device size with M = 1000*1000: 256060 MBytes (256 GB)
cache/buffer size = unknown
Form Factor: 2.5 inch
Nominal Media Rotation Rate: Solid State Device
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16 Current = 16
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
* NOP cmd
* DOWNLOAD_MICROCODE
* 48-bit Address feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART error logging
* General Purpose Logging feature set
* WRITE_{DMA|MULTIPLE}_FUA_EXT
* 64-bit World wide name
* WRITE_UNCORRECTABLE_EXT command
* {READ,WRITE}_DMA_EXT_GPL commands
* Segmented DOWNLOAD_MICROCODE
* Gen1 signaling speed (1.5Gb/s)
* Gen2 signaling speed (3.0Gb/s)
* Gen3 signaling speed (6.0Gb/s)
* Native Command Queueing (NCQ)
* Phy event counters
* Device automatic Partial to Slumber transitions
* READ_LOG_DMA_EXT equivalent to READ_LOG_EXT
* DMA Setup Auto-Activate optimization
* Device-initiated interface power management
* Software settings preservation
Device Sleep (DEVSLP)
* SMART Command Transport (SCT) feature set
* SCT Error Recovery Control (AC3)
* SCT Features Control (AC4)
* SCT Data Tables (AC5)
* Data Set Management TRIM supported (limit 8 blocks)
* Deterministic read ZEROs after TRIM
Security:
Master password revision code = 65534
supported
not enabled
not locked
frozen
not expired: security count
supported: enhanced erase
2min for SECURITY ERASE UNIT. 2min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 5707c18100a69f6b
NAA : 5
IEEE OUI : 707c18
Unique ID : 100a69f6b
Device Sleep:
DEVSLP Exit Timeout (DETO): 100 ms (drive)
Minimum DEVSLP Assertion Time (MDAT): 31 ms (drive)
Checksum: correct
Additional info
Fortunately, I could prevent the systems from hanging at boot with the prompt to run an fsck, by providing the kernel with the parameter fsck.repair=yes.
Here's the second (23) system's journal after such a reboot:
journal
Last edited by schard (2022-08-19 08:26:57)
Inofficial first vice president of the Rust Evangelism Strike Force
Offline
They *all* have the commwake error (but no ata bus error), even if it's turning fatal on delay, there's some™ issue w/ that…
hdparm -S 0 /dev/sda
?
Offline
Last login: Fri Aug 19 10:25:23 2022 from fdbc:83e9:4512:ea57::1
0 ✓ 1284 ~ $ hdparm -S 0 /dev/sda
/dev/sda:
setting standby to 0 (off)
Last login: Fri Aug 19 10:17:54 2022 from fdbc:83e9:4512:ea57::1
0 ✓ 23 ~ $ hdparm -S 0 /dev/sda
/dev/sda:
setting standby to 0 (off)
And now we wait (I guess?).
Last edited by schard (2022-08-19 08:30:31)
Inofficial first vice president of the Rust Evangelism Strike Force
Offline
You could as a cross-test try to -Y the device, wait a bit and see whether subsequent attempts to access it immediately trigger the error.
The temperatures are all over the place (otherwise I'd have asked you whether you posses a hair-dryer )
Offline
What do you know. I had journalctl -b --follow running, then issued hdparm -Y /dev/sda and instantly:
Aug 19 10:52:40 23 kernel: ata2.00: exception Emask 0x0 SAct 0x8000 SErr 0x50000 action 0x6
Aug 19 10:52:41 23 kernel: ata2.00: waking up from sleep
Aug 19 10:52:41 23 kernel: ata2: SError: { PHYRdyChg CommWake }
Aug 19 10:52:41 23 kernel: ata2: hard resetting link
Aug 19 10:52:41 23 kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 19 10:52:41 23 kernel: ata2.00: configured for UDMA/133
Aug 19 10:52:41 23 kernel: ata2: EH complete
I accidentally issued the command a second time and got a weird log message:
Aug 19 10:56:27 23 kernel: ata2.00: exception Emask 0x0 SAct 0xff0000 SErr 0x50000 action 0x6
Aug 19 10:56:27 23 kernel: ata2.00: waking up from sleep, waking up from sleep, waking up from sleep, waking up fro
Aug 19 10:56:27 23 kernel: ata2: SError: { PHYRdyChg CommWake }
Aug 19 10:56:27 23 kernel: ata2: hard resetting link
Aug 19 10:56:27 23 kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 19 10:56:27 23 kernel: ata2.00: configured for UDMA/133
Aug 19 10:56:27 23 kernel: ata2: EH complete
The "waking up from sleep" line seems to be repeated multiple times.
Last edited by schard (2022-08-19 09:43:43)
Inofficial first vice president of the Rust Evangelism Strike Force
Offline
It's maybe flushing some stacked events.(?)
It very much looks like the device supports
Aug 18 09:54:36 archlinux kernel: ata2.00: Features: Dev-Sleep
more in theory than in practice.
Waiting whether the no-sleep instance acts up again is the only thing I can suggest now - idk whether there's any libata setting/quirk to apply here, but I guess a udev rule would hopefully suffice…
Offline