You are not logged in.

#1 2022-07-13 14:06:00

schard
Member
From: Hannover
Registered: 2016-05-06
Posts: 1,406
Website

Spontaneous ro root file system on some machines

I have the issue that on some machines, the root file system is re-mountet read-only after some time.
The systems are digital signage systems with industry hardware, that run 24/7.
As an example I picked out one of the candidates:

0 ✓ 1261 ~ $ mount
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
sys on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
dev on /dev type devtmpfs (rw,nosuid,relatime,size=1930676k,nr_inodes=482669,mode=755,inode64)
run on /run type tmpfs (rw,nosuid,nodev,relatime,mode=755,inode64)
efivarfs on /sys/firmware/efi/efivars type efivarfs (rw,nosuid,nodev,noexec,relatime)
/dev/sda2 on / type ext4 (ro,relatime)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,inode64)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
bpf on /sys/fs/bpf type bpf (rw,nosuid,nodev,noexec,relatime,mode=700)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=30,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=11553)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
tracefs on /sys/kernel/tracing type tracefs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /tmp type tmpfs (rw,nosuid,nodev,size=1955248k,nr_inodes=1048576,inode64)
fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime)
configfs on /sys/kernel/config type configfs (rw,nosuid,nodev,noexec,relatime)
none on /run/credentials/systemd-sysusers.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)
/dev/sda1 on /boot type vfat (rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,utf8,errors=remount-ro)
tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=391048k,nr_inodes=97762,mode=700,inode64)
0 ✓ 1261 ~ $ cat /etc/fstab 
# /dev/sda2 UUID=a1982c7a-261b-451b-8dce-570e69bcf568
LABEL=root          	/         	ext4      	rw,relatime	0 1

# /dev/sda1 UUID=2290-764C
LABEL=EFI           	/boot     	vfat      	rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,utf8,errors=remount-ro	0 2
0 ✓ 1261 ~ $ LANG=C df -h
Filesystem      Size  Used Avail Use% Mounted on
dev             1.9G     0  1.9G   0% /dev
run             1.9G  820K  1.9G   1% /run
/dev/sda2       234G  7.0G  215G   4% /
tmpfs           1.9G     0  1.9G   0% /dev/shm
tmpfs           1.9G     0  1.9G   0% /tmp
/dev/sda1       499M   53M  447M  11% /boot
tmpfs           382M     0  382M   0% /run/user/0

dmesg
journalctl -b

The interesting part seems to be:

Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#21 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=40s
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#21 CDB: Write(10) 2a 00 0e f3 48 28 00 00 18 00
Jun 22 22:20:28 1261 kernel: blk_update_request: I/O error, dev sda, sector 250824744 op 0x1:(WRITE) flags 0x800 phys_seg 3 prio class 0
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#20 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=40s
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#20 CDB: Write(10) 2a 00 00 e2 ea e0 00 00 08 00
Jun 22 22:20:28 1261 kernel: blk_update_request: I/O error, dev sda, sector 14871264 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Jun 22 22:20:28 1261 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 524975 starting block 1858909)
Jun 22 22:20:28 1261 kernel: Buffer I/O error on device sda2, logical block 1730652
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#19 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=40s
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#19 CDB: Write(10) 2a 00 00 e2 ea 18 00 00 18 00
Jun 22 22:20:28 1261 kernel: blk_update_request: I/O error, dev sda, sector 14871064 op 0x1:(WRITE) flags 0x0 phys_seg 2 prio class 0
Jun 22 22:20:28 1261 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 524975 starting block 1858886)
Jun 22 22:20:28 1261 kernel: Buffer I/O error on device sda2, logical block 1730627
Jun 22 22:20:28 1261 kernel: Buffer I/O error on device sda2, logical block 1730628
Jun 22 22:20:28 1261 kernel: Buffer I/O error on device sda2, logical block 1730629
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#31 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=35s
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#31 CDB: Write(10) 2a 00 00 e2 e9 08 00 00 08 00
Jun 22 22:20:28 1261 kernel: blk_update_request: I/O error, dev sda, sector 14870792 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Jun 22 22:20:28 1261 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 524975 starting block 1858850)
Jun 22 22:20:28 1261 kernel: Buffer I/O error on device sda2, logical block 1730593
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#23 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=35s
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#23 CDB: Write(10) 2a 00 00 e2 eb 88 00 00 08 00
Jun 22 22:20:28 1261 kernel: blk_update_request: I/O error, dev sda, sector 14871432 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Jun 22 22:20:28 1261 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 524975 starting block 1858930)
Jun 22 22:20:28 1261 kernel: Buffer I/O error on device sda2, logical block 1730673
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#22 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=35s
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#22 CDB: Write(10) 2a 00 00 e2 eb 00 00 00 08 00
Jun 22 22:20:28 1261 kernel: blk_update_request: I/O error, dev sda, sector 14871296 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Jun 22 22:20:28 1261 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 524975 starting block 1858913)
Jun 22 22:20:28 1261 kernel: Buffer I/O error on device sda2, logical block 1730656
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=35s
Jun 22 22:20:28 1261 kernel: sd 1:0:0:0: [sda] tag#0 CDB: Write(10) 2a 00 00 e2 e9 d0 00 00 10 00
Jun 22 22:20:28 1261 kernel: blk_update_request: I/O error, dev sda, sector 14870992 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Jun 22 22:20:28 1261 kernel: EXT4-fs warning (device sda2): ext4_end_bio:344: I/O error 10 writing to inode 524975 starting block 1858876)
Jun 22 22:20:28 1261 kernel: Buffer I/O error on device sda2, logical block 1730618
Jun 22 22:20:28 1261 kernel: Buffer I/O error on device sda2, logical block 1730619
Jun 22 22:20:28 1261 kernel: Aborting journal on device sda2-8.
Jun 22 22:20:28 1261 kernel: EXT4-fs error (device sda2) in ext4_reserve_inode_write:5731: Journal has aborted
Jun 22 22:20:28 1261 kernel: EXT4-fs error (device sda2): ext4_dirty_inode:5927: inode #524975: comm chromium: mark_inode_dirty error
Jun 22 22:20:28 1261 kernel: EXT4-fs error (device sda2) in ext4_dirty_inode:5928: Journal has aborted
Jun 22 22:20:28 1261 kernel: EXT4-fs error (device sda2): ext4_journal_check_start:83: comm systemd-journal: Detected aborted journal
Jun 22 22:20:28 1261 kernel: EXT4-fs (sda2): Remounting filesystem read-only
Jun 22 22:21:06 1261 sshd[73397]: error: kex_exchange_identification: read: Connection reset by peer

Does anybody have an idea why this is happening?
According to smartctl, the disk is fine:

$ smartctl -a /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.15.45-1-lts] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ADATA_IMSS316-256GD
Serial Number:    2K422L127DKD
LU WWN Device Id: 5 707c18 100985dfd
Firmware Version: S180718b
User Capacity:    256.060.514.304 bytes [256 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        Not in smartctl database 7.3/5319
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Wed Jul 13 16:05:38 2022 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 249)	Self-test routine in progress...
					90% of test remaining.
Total time to complete Offline 
data collection: 		(   33) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (   2) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x0039)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 0
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       8442
 12 Power_Cycle_Count       0x0012   100   100   000    Old_age   Always       -       9
167 Unknown_Attribute       0x0022   100   100   000    Old_age   Always       -       0
168 Unknown_Attribute       0x0012   100   100   000    Old_age   Always       -       0
169 Unknown_Attribute       0x0013   099   099   010    Pre-fail  Always       -       4194312
173 Unknown_Attribute       0x0012   200   200   000    Old_age   Always       -       12885229571
175 Program_Fail_Count_Chip 0x0013   100   100   010    Pre-fail  Always       -       0
180 Unused_Rsvd_Blk_Cnt_Tot 0x0033   100   100   020    Pre-fail  Always       -       10120
192 Power-Off_Retract_Count 0x0012   100   100   000    Old_age   Always       -       7
194 Temperature_Celsius     0x0022   055   055   030    Old_age   Always       -       45 (Min/Max 21/61)
231 Unknown_SSD_Attribute   0x0033   100   100   005    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       1514614464
234 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       3033051297024
241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       542263652
242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       238273046

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      8384         -
# 2  Extended offline    Completed without error       00%      8206         -
# 3  Extended offline    Completed without error       00%      8028         -
# 4  Extended offline    Completed without error       00%      7844         -
# 5  Extended offline    Completed without error       00%      7676         -
# 6  Extended offline    Completed without error       00%      7508         -
# 7  Extended offline    Completed without error       00%      7341         -
# 8  Extended offline    Completed without error       00%      7173         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Last edited by schard (Yesterday 15:11:46)


Солідарність з Україною

Offline

#2 2022-07-13 14:19:32

ewaller
Administrator
From: Pasadena, CA
Registered: 2009-07-13
Posts: 18,641

Re: Spontaneous ro root file system on some machines

The change to read only is usually an indication of a I/O error, which it appears has happened in your case.
Are these SSDs or rotating disks?  Are these indoor or outdoor displays?  I understand ambient temperatures have been near 30C in (what I assume) is your part of the planet.  With solar loading and normal operating temperature rise you could well be bumping your head on the maximum operating temperatures for commercial rated equipment.  Heck, around here with solar loading our passively cooled systems can blow through the operating temperature limits of industrial rated drives

Last edited by ewaller (2022-07-13 14:19:57)


Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way

Offline

#3 2022-07-13 14:24:50

schard
Member
From: Hannover
Registered: 2016-05-06
Posts: 1,406
Website

Re: Spontaneous ro root file system on some machines

The systems are mounted indoors.
They are tenant information systems (digital black boards) mounted on the walls near the entry of stair cases in apartment buildings.


Солідарність з Україною

Offline

#4 2022-07-13 14:49:20

seth
Member
Registered: 2012-09-03
Posts: 30,881

Re: Spontaneous ro root file system on some machines

* cable
* cable
* power supply
* cable
* power saving
* cable

If this happens somewhat predictably, can you prevent it by keeping the drive busy, eg. frequently read random files (to avoid the cache) into /dev/null or so?
Or, delay the low-power (hdparm -S 251, check the manpage on the parameter) or maybe vv. trigger the behavior by sending the drive to sleep.
Also keep in mind any power management daemons - and aspm might spoil you if it's the bus.

Did I mention the cable?

Offline

#5 2022-07-13 15:18:09

ewaller
Administrator
From: Pasadena, CA
Registered: 2009-07-13
Posts: 18,641

Re: Spontaneous ro root file system on some machines

Seth, I agree about the cable.  What I am hung up on is that this is multiple machines.  That leads me to want to look for something systemic.

Schard, how long have these been fielded? Were they all fielded at the same time, or where they installed progressively over time?


Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way

Offline

#6 2022-07-14 08:32:11

schard
Member
From: Hannover
Registered: 2016-05-06
Posts: 1,406
Website

Re: Spontaneous ro root file system on some machines

@seth I assume with "cable" you are referring to the connection between the mainboard and the SSD?
The SSD is an mSATA SSD and is thusly directly plugged into a socket on the mainboard.
While it is unlikely that dust has come between the contacts and plug, I will check this out once the device has been brought in.
I will also check any possible power saving settings.
However, we have hundreds of identical devices (all running Arch Linux) out in the field and this happens only on a tiny fraction of those.

Another problem is, that after issuing a reboot, the devices do not boot anymore, but usually hang with this error message.

@ewaller
The systems have been out in the field for different amounts of time. Some are new, some are older. The system in question is rather new and has been deployed for about 2 1/2 years.

Last edited by schard (2022-07-14 08:33:13)


Солідарність з Україною

Offline

#7 2022-07-14 14:11:39

ewaller
Administrator
From: Pasadena, CA
Registered: 2009-07-13
Posts: 18,641

Re: Spontaneous ro root file system on some machines

This is a clear case of bitrot wink

What fixes them?  Reinstalling? Reformatting? Replacing the drive?
My hunch remains is that it is systemic and is environmental, though I will never discount cables.  I should mention, cables are the bane of every engineer's existence, myself included.
If is is not thermal, perhaps EMC/EMI?  Unlikely to be shock, vibration, humidity, sand, dust, fungal growth or corrosion due to salt fog.

What do the power supplies look like?  I am sure you are using stuff that meet the CE directives for conducted susceptibility for low voltage systems -- but mains power can get ugly this time of year too.  Any signs other systems are spontaneously rebooting?  Any evidence of power failures across the network?


Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way

Offline

#8 2022-07-14 14:28:14

seth
Member
Registered: 2012-09-03
Posts: 30,881

Re: Spontaneous ro root file system on some machines

The SSD is an mSATA SSD and is thusly directly plugged into a socket on the mainboard.
While it is unlikely that dust has come between the contacts and plug, I will check this out once the device has been brought in.

For mSATAs, the demon is usually the securing screw/s
They can end up bending the card out of the slot and certainly get it under a lot of tension. You might even have lost the slot, by the card ripping it off the mainboard.
If all devices are the same, this could come down to how tight the screw was fixed - or whether it's canted.

However, we have hundreds of identical devices (all running Arch Linux) out in the field and this happens only on a tiny fraction of those.

Suggests it's the HW, not the SW.

Offline

#9 2022-07-15 08:29:35

schard
Member
From: Hannover
Registered: 2016-05-06
Posts: 1,406
Website

Re: Spontaneous ro root file system on some machines

ewaller wrote:

What fixes them?  Reinstalling? Reformatting? Replacing the drive?

Indeed in many cases a re-installation of the whole OS fixes the issue and the systems run again fine with the same disks.
This it what baffles me. Of course also the hardware manufacturer uses this fact as an excuse that the error must be software-related and not related to their hardware.

ewaller wrote:

What do the power supplies look like?

The power supplies are built into the systems and are not accessible from the access cover for the main board.
I'll have to ask the hardware manufacturer for details.

ewaller wrote:

Any signs other systems are spontaneously rebooting?

None that I know of. The systems run fine 24/7 without reboots.
The only reboots are done on the more or less regular "patch days".

ewaller wrote:

Any evidence of power failures across the network?

Unfortunately it is hard to come by this kind of information.
The systems are spread all over the country and power grid failures are a fairly seldom occasion here in Germany and mostly a local occurence.
I know that we had down times in the past due to power failures in at least two cities, but none of them resulted in defects like this.
The systems just booted up without any issues once power was restored.

seth wrote:

For mSATAs, the demon is usually the securing screw/s

I will definitely check this out.
I expect the system to be brought in within the upcoming week.


Солідарність з Україною

Offline

#10 2022-08-10 16:44:01

schard
Member
From: Hannover
Registered: 2016-05-06
Posts: 1,406
Website

Re: Spontaneous ro root file system on some machines

So. After analyzing a faulty system this week, I found out, by just swapping around SSDs, that the SSD of the faulty system is defective. It produces the same errors when plugged into another system and the faulty system runs fine with another SSD.


Солідарність з Україною

Offline

#11 2022-08-10 17:17:24

ewaller
Administrator
From: Pasadena, CA
Registered: 2009-07-13
Posts: 18,641

Re: Spontaneous ro root file system on some machines

Glad you got it sorted, but I thought you had said 'some' machines?  That led me to look for something more systemic.


Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way

Offline

#12 2022-08-10 22:16:19

schard
Member
From: Hannover
Registered: 2016-05-06
Posts: 1,406
Website

Re: Spontaneous ro root file system on some machines

Yes there are more, but I doubt I will get to have a look at them, because, well, costs.


Солідарність з Україною

Offline

#13 Yesterday 15:14:11

schard
Member
From: Hannover
Registered: 2016-05-06
Posts: 1,406
Website

Re: Spontaneous ro root file system on some machines

I have to amend, that this topic is not yet solved.
We decided to test the disk further.
On the very same system, Windows 10 runs seamlessly.
Chkdsk does not detect any errors with the disk during an extended check (chkdsk /R).
Also the ADATA SSD ToolBox finds nothing wrong with the drive in either the short or extended test.
I am starting to consider a kernel bug.


Солідарність з Україною

Offline

#14 Yesterday 15:26:27

seth
Member
Registered: 2012-09-03
Posts: 30,881

Re: Spontaneous ro root file system on some machines

seth wrote:

Also keep in mind any power management daemons - and aspm might spoil you if it's the bus.

Tried "pcie_aspm=off" ?

Offline

#15 Yesterday 16:09:42

schard
Member
From: Hannover
Registered: 2016-05-06
Posts: 1,406
Website

Re: Spontaneous ro root file system on some machines

Not yet, thanks for the reminder. Will try on Monday.


Солідарність з Україною

Offline

Board footer

Powered by FluxBB