[RESOLVED] `shutdown now` frequently causes journal/fsck errors

swolebro · 2022-10-24 22:48:29

Hey guys, I'm having some problems with my first installation of Arch Linux on an x86_64 system, and I'm not sure where it's originating. I've used Arch lightly before, but only on ARM SoC's which haven't required much management. My other machines run Gentoo, so they work... differently.

I did the install a couple days ago, using the current ISO (2022.10.01), following the wiki, and keeping it vanilla. It has one SSD with a GPT and two partitions: BIOS boot and ext4 /. No swap, no LVM, no LUKS, no desktop environment (headless 2U server). The problem is, 75% of the time I boot the machine, after the grub selection screen, I have journal errors that require me to run fsck from the local terminal. Not only is this annoying for a headless machine, but it feels like I'm playing Russian roulette waiting for my filesystem to be irreparably corrupted.

Transcribing the error here... (please pardon any minor typos)

Starting version 215.6-2-arch
/dev/sda2: Superblock needs_recovery flag is clear, but journal has data.
/dev/sda2: Run journal anyway

/dev/sda2: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY
          (i.e., without -a or -p options)
ERROR Bailing out. Run `fsck /dev/sda2` manually
********** FILESYSTEM CHECK FAILED **********
*  Please run fsck manually. After leaving  *
*  this maintenance shell, the system will  *
*  reboot automatically.                    *
*********************************************

And then I get a shell, run fsck, have it find a handful of errors, exit, reboot, and Arch starts up properly. I've noticed that if I write a file and shutdown (eg. echo test > ~/test.txt && sudo shutdown -r now) that file is very likely to be missing upon reboot. I'm only ever shutting down using shutdown, as that is the command I have used for many years on every distro (Gentoo, Mint, Debian, RHEL,...). I do understand that on Arch, this is really a symlink to systemctl, but I am surprised that it is letting the shutdown happen uncleanly.

Looking at the journal, on a reboot where I don't have any fsck errors, I see this:

swolebro@r815$ sudo journalctl -b -3 -n 15
Oct 24 16:43:32 r815 audit: BPF prog-id=0 op=UNLOAD
Oct 24 16:43:32 r815 systemd[1]: Stopped Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling.
Oct 24 16:43:32 r815 systemd[1]: Reached target System Shutdown.
Oct 24 16:43:32 r815 systemd[1]: Reached target Late Shutdown Services.
Oct 24 16:43:32 r815 systemd[1]: systemd-poweroff.service: Deactivated successfully.
Oct 24 16:43:32 r815 systemd[1]: Finished System Power Off.
Oct 24 16:43:32 r815 systemd[1]: Reached target System Power Off.
Oct 24 16:43:32 r815 systemd[1]: Shutting down.
Oct 24 16:43:32 r815 audit: BPF prog-id=0 op=UNLOAD
Oct 24 16:43:32 r815 audit: BPF prog-id=0 op=UNLOAD
Oct 24 16:43:32 r815 audit: BPF prog-id=0 op=UNLOAD
Oct 24 16:43:32 r815 systemd-shutdown[1]: Syncing filesystems and block devices.
Oct 24 16:43:33 r815 systemd-shutdown[1]: Sending SIGTERM to remaining processes...
Oct 24 16:43:33 r815 systemd-journald[740]: Received SIGTERM from PID 1 (systemd-shutdow).
Oct 24 16:43:33 r815 systemd-journald[740]: Journal stopped

On a reboot where I have errors, the prior boot's final journal lines are arbitrary, benign log messages, entirely unrelated to the shutdown process.

I have tried running the latest of every package (pacman -Syu earlier today, with linux 6.0.2.arch1-1 and systemd 251.6-2) as well as setting my mirrorlist to ALA 2022.10.01 (pacman -Syyuu for linux kernel 5.19, systemd 251.4-1) and both have the same problem. The mkinitcpio.conf hooks include fsck, and the fstab entry for the root partition's UUID has an fs_passno of 1. (I'm happy to have both of these things enabled, since at least I am now aware of the unclean shutdown problems.) I do not believe the problem is the SSD, as it was operating normally in another machine prior to this.

My Google-fu may be weak, but I have not been able to find any tickets related to this problem either on the Arch bug tracker or the systemd GitHub issues. Anybody got any pointers? Y'all the experts here.

Last edited by swolebro (2023-03-13 21:13:59)

cfr · 2022-10-25 00:48:05

smartctl -x /dev/<specify-disk>
lsblk 
df -h

Post fstab and the complete hooks line for mkinitcpio.conf.

Please also post the journal from a problematic shutdown.

Last edited by cfr (2022-10-25 00:51:04)

ewaller · 2022-10-25 02:32:32

I am wondering if systemd shutdown would ensure the journal was written prior to pulling the plug.

swolebro · 2022-10-25 05:26:30

swolebro@r815$ sudo smartctl -x /dev/sda -d megaraid,0
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.0.2-arch1-1] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     SandForce Driven SSDs
Device Model:     OCZ-VERTEX2
Serial Number:    OCZ-43G3AD561I20NDH4
LU WWN Device Id: 5 e83a97 fb27ed20e
Firmware Version: 1.29
User Capacity:    115,033,153,536 bytes [115 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
TRIM Command:     Available, deterministic
Device is:        In smartctl database 7.3/5319
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Mon Oct 24 23:54:21 2022 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Write SCT (Get) Feature Control Command failed: ATA return descriptor not supported by controller firmware
Wt Cache Reorder: Unknown (SCT Feature Control command failed)

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: ATA return descriptor not supported by controller firmware
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x7f) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Abort Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 (  48) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR--   120   120   050    -    0/0
  5 Retired_Block_Count     PO--CK   100   100   003    -    0
  9 Power_On_Hours_and_Msec -O--CK   100   100   000    -    23539h+56m+59.250s
 12 Power_Cycle_Count       -O--CK   100   100   000    -    327
171 Program_Fail_Count      -O--CK   000   000   000    -    0
172 Erase_Fail_Count        -O--CK   000   000   000    -    0
174 Unexpect_Power_Loss_Ct  ----CK   000   000   000    -    114
177 Wear_Range_Delta        ------   000   000   000    -    1
181 Program_Fail_Count      -O--CK   000   000   000    -    0
182 Erase_Fail_Count        -O--CK   000   000   000    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
194 Temperature_Celsius     -O---K   030   129   000    -    30 (Min/Max 30/30)
195 ECC_Uncorr_Error_Count  --SRC-   120   120   000    -    0/0
196 Reallocated_Event_Count PO--CK   100   100   000    -    0
231 SSD_Life_Left           PO--C-   100   100   010    -    0
233 SandForce_Internal      ------   000   000   000    -    6464
234 SandForce_Internal      -O--CK   000   000   000    -    3712
241 Lifetime_Writes_GiB     -O--CK   000   000   000    -    3712
242 Lifetime_Reads_GiB      -O--CK   000   000   000    -    12096
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL,SL  R/O      1  SATA Phy Event Counters log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xb7       GPL,SL  VS      16  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log (GP Log 0x03) not supported

SMART Error Log not supported

SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       0 (0x0000)
Device State:                        Active (0)
Current Temperature:                     ? Celsius
Power Cycle Min/Max Temperature:     127/-127 Celsius
Lifetime    Min/Max Temperature:     127/-127 Celsius
Under/Over Temperature Limit Count:   0/0

Write SCT Data Table failed: SMART WRITE LOG SECTOR may cause problems, try with -T permissive to force
Read SCT Temperature History failed

Write SCT (Get) Error Recovery Control Command failed: ATA return descriptor not supported by controller firmware
SCT (Get) Error Recovery Control command failed

Device Statistics (GP/SMART Log 0x04) not supported

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x000a  2            1  Device-to-host register FISes sent due to a COMRESET

swolebro@r815$ echo $?
4

Note that the return code was nonzero, and I needed to pass an option for it to address the first (and, for now, the only active) slot in the PERC H700 hardware RAID controller.

swolebro@r815$ lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda      8:0    0 106.6G  0 disk
├─sda1   8:1    0     2G  0 part
└─sda2   8:2    0 104.6G  0 part /
sr0     11:0    1  1024M  0 rom

swolebro@r815$ df -h
Filesystem      Size  Used Avail Use% Mounted on
dev             126G     0  126G   0% /dev
run             126G  9.5M  126G   1% /run
/dev/sda2       103G  6.0G   92G   7% /
tmpfs           126G     0  126G   0% /dev/shm
tmpfs           126G     0  126G   0% /tmp
tmpfs            26G  4.0K   26G   1% /run/user/1000

Yes, those tmpfs's are huge - just the defaults. This machine has a lot of RAM.

swolebro@r815$ cat /etc/fstab
# <file system> <dir> <type> <options> <dump> <pass>
# 120GB OCZ SSD root partition, unencrypted
UUID=85b584ae-1f2a-4552-b9c3-dacf9cbdb4b1       /               ext4            rw,relatime     0 1

swolebro@r815$ cat /etc/mkinitcpio.conf | grep "^HOOKS="
HOOKS=(base udev autodetect modconf block lvm2 filesystems keyboard fsck)

I added the lvm2 hook in there (beyond the default settings) since I intend to add additional fstab/automount disks later, some of which have LV's. As lsblk above shows, the current disk doesn't have them.

As for journactl log, I just found a fun scenario. On my last boot, I took a look at the active log before restarting and scp'd it back to my laptop. For the sake of brevity, here's an excerpt of the head and tail.

Oct 25 00:05:12 r815 kernel: Linux version 6.0.2-arch1-1 (linux@archlinux) (gcc (GCC) 12.2.0, GNU ld (GNU Binutils) 2.39.0) #1 SMP PREEMPT_DYNAMIC Sat, 15 Oct 2022 14:00:49 +0000
Oct 25 00:05:12 r815 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-linux root=UUID=85b584ae-1f2a-4552-b9c3-dacf9cbdb4b1 rw loglevel=3 quiet
Oct 25 00:05:12 r815 kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Oct 25 00:05:12 r815 kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Oct 25 00:05:12 r815 kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Oct 25 00:05:12 r815 kernel: x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
Oct 25 00:05:12 r815 kernel: x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
Oct 25 00:05:12 r815 kernel: signal: max sigframe size: 1776
Oct 25 00:05:12 r815 kernel: BIOS-provided physical RAM map:
Oct 25 00:05:12 r815 kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009dfff] usable

[... snip ...]

Oct 25 00:10:20 r815 dbus-daemon[986]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.home1.service': Unit dbus-org.freedesktop.home1.service not found.
Oct 25 00:10:20 r815 sudo[1190]: pam_systemd_home(sudo:account): systemd-homed is not available: Unit dbus-org.freedesktop.home1.service not found.
Oct 25 00:10:20 r815 audit[1190]: USER_ACCT pid=1190 uid=1000 auid=1000 ses=1 msg='op=PAM:accounting grantors=pam_unix,pam_permit,pam_time acct="swolebro" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=success'
Oct 25 00:10:20 r815 sudo[1190]:      swolebro : TTY=pts/0 ; PWD=/home/swolebro ; USER=root ; COMMAND=/usr/bin/journalctl -b 0
Oct 25 00:10:20 r815 audit[1190]: CRED_REFR pid=1190 uid=1000 auid=1000 ses=1 msg='op=PAM:setcred grantors=pam_faillock,pam_permit,pam_env,pam_faillock acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=success'
Oct 25 00:10:20 r815 sudo[1190]: pam_unix(sudo:session): session opened for user root(uid=0) by swolebro(uid=1000)
Oct 25 00:10:20 r815 audit[1190]: USER_START pid=1190 uid=1000 auid=1000 ses=1 msg='op=PAM:session_open grantors=pam_systemd_home,pam_limits,pam_unix,pam_permit acct="root" exe="/usr/bin/sudo" hostname=?
addr=? terminal=/dev/pts/0 res=success'
Oct 25 00:10:20 r815 kernel: audit: type=1101 audit(1666671020.908:58): pid=1190 uid=1000 auid=1000 ses=1 msg='op=PAM:accounting grantors=pam_unix,pam_permit,pam_time acct="swolebro" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=success'
Oct 25 00:10:20 r815 kernel: audit: type=1110 audit(1666671020.908:59): pid=1190 uid=1000 auid=1000 ses=1 msg='op=PAM:setcred grantors=pam_faillock,pam_permit,pam_env,pam_faillock acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=success'
Oct 25 00:10:20 r815 kernel: audit: type=1105 audit(1666671020.908:60): pid=1190 uid=1000 auid=1000 ses=1 msg='op=PAM:session_open grantors=pam_systemd_home,pam_limits,pam_unix,pam_permit acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=success'

I can post the full thing if desired. The interesting part is that I was able to reboot without the fsck interruption, but there's clearly files missing, as well as the journalctl log! Here's the last couple commands on that terminal, where I was ssh'd to the Arch server, with some comments of mine added.

swolebro@r815$ sudo journalctl -b 0 > journal-active.txt # scp'd this to my laptop in another terminal
swolebro@r815$ ls # check current files in home
journal-active.txt  test2.txt  test3.txt  test4.txt
swolebro@r815$ rm *.txt; echo test > test-again.txt; sudo shutdown now
Shutdown scheduled for Tue 2022-10-25 00:11:01 EDT, use 'shutdown -c' to cancel.
swolebro@r815$ Shared connection to r815 closed.
swolebro@laptop$ boot-server # wake-on-lan wrapper script, booted without fsck
swolebro@laptop$ ssh r815 # note that the last login time is correct
Last login: Tue Oct 25 00:05:25 2022 from 192.168.1.171
swolebro@r815$ ls # files not showing the rm/echo from above
journal-active.txt  test2.txt  test3.txt  test4.txt
swolebro@r815$ sudo journalctl --list-boots | tail -n 5  # not showing the prior boot at all!
 -4 3b2016029da74ef7855cc0123e8c2fa6 Mon 2022-10-24 16:25:52 EDT Mon 2022-10-24 16:26:05 EDT
 -3 23e140365d334ea38e275009d8cb0d45 Mon 2022-10-24 16:33:03 EDT Mon 2022-10-24 16:43:33 EDT
 -2 c7c47467f2d844538b9e7fed2c1bf4e9 Mon 2022-10-24 16:46:36 EDT Mon 2022-10-24 16:47:01 EDT
 -1 81c330271d204d8f8f4c785c460e9a36 Mon 2022-10-24 16:50:17 EDT Mon 2022-10-24 16:51:44 EDT
  0 f805939a2ce044a7ae1e286289584f74 Tue 2022-10-25 00:14:02 EDT Tue 2022-10-25 00:25:37 EDT

So the log from that boot is totally missing, as is the boot immediately before it (from about 11:45 PM EDT). It goes straight from the currently running log back to when I was messing around earlier this afternoon before posting this thread. That makes it difficult to post the very last messages from an errant shutdown, hahah.

For the record, I have also managed to get this behavior using `sudo systemctl poweroff`. I have yet to try `systemd shutdown` (I didn't know it was a thing), but I'd prefer to have `shutdown now` work correctly, as I can guarantee that old habits will make me run it on this machine sooner or later, and it's currently a big risk of filesystem corruption.

seth · 2022-10-25 06:31:38

PERC H700 hardware RAID

What if we remove that thing from the equation?
Because it has cache and a battery and if that doesn't work fsync syncs into the raid cache and then the power is gone w/ the shutdown and the raid cache gets synced into the nirvana…

swolebro · 2022-10-25 16:51:40

Got some more testing done this morning.

I don't have a way of testing this server (Dell R815) without the RAID controller, at least not without buying other hardware, since the motherboard has no SATA ports of its own. I did try moving the SSD to a desktop with a SATA interface, and replicated the issue there. This includes more instances where files were missing (even when booting did not require an fsck), or where journalctrl dropped entries from --list-boots, or had the boots, but lost the log data. I ended up having to rotate/vacuum the logs in order to clear them and get that working again.

That's about what I expected, since I previously had this server happily running CentOS as a libvirt/VM host, albeit with a different disk (a single WD Blue HDD), and think it would have been obvious if it were corrupting my VM images.

Moving my attention back to the SSD itself, that was recently used for hosting some git and git-annex repos, both of which involve lots of hashes, and neither git-fsck or git-annex-fsck ever complained at me. To further check the disk, I ran a long SMART test (sudo smartctl -t long -d megaraid,0 /dev/sda) and 48 minutes later, this was the result:

swolebro@r815$ sudo smartctl -A -d megaraid,0 /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.0.2-arch1-1] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   119   102   050    Pre-fail  Always       -       0/230256202
  5 Retired_Block_Count     0x0033   100   100   003    Pre-fail  Always       -       0
  9 Power_On_Hours_and_Msec 0x0032   100   100   000    Old_age   Always       -       23544h+01m+18.740s
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       351
171 Program_Fail_Count      0x0032   000   000   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   000   000   000    Old_age   Always       -       0
174 Unexpect_Power_Loss_Ct  0x0030   000   000   000    Old_age   Offline      -       120
177 Wear_Range_Delta        0x0000   000   000   000    Old_age   Offline      -       1
181 Program_Fail_Count      0x0032   000   000   000    Old_age   Always       -       0
182 Erase_Fail_Count        0x0032   000   000   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   030   129   000    Old_age   Always       -       30 (Min/Max 30/30)
195 ECC_Uncorr_Error_Count  0x001c   119   102   000    Old_age   Offline      -       0/230256202
196 Reallocated_Event_Count 0x0033   100   100   000    Pre-fail  Always       -       0
231 SSD_Life_Left           0x0013   100   100   010    Pre-fail  Always       -       0
233 SandForce_Internal      0x0000   000   000   000    Old_age   Offline      -       6464
234 SandForce_Internal      0x0032   000   000   000    Old_age   Always       -       3712
241 Lifetime_Writes_GiB     0x0032   000   000   000    Old_age   Always       -       3712
242 Lifetime_Reads_GiB      0x0032   000   000   000    Old_age   Always       -       12224

Given the high scores for Raw_Read_Error_Rate and SSD_Life_Left, the fact that Power_On_Hours_and_Msec is only ~3yrs, and that the Lifetime_Writes_GiB is only ~30x the drive size, I think it should be fine.

I guess my next steps will be to slap Linux Mint on it and see if it's still barfing. If it's not, I'll try Arch again (in case I botched the install somehow). If it continues to barf, I'll experiment with another drive or a new RAID controller battery (~$20)... just in case.

Last edited by swolebro (2022-10-25 16:53:57)

cfr · 2022-10-25 18:50:19

That output doesn't show the results of the test or that the test was run.

swolebro · 2022-10-25 19:59:22

It does show a significant increase in the denominator of the Raw_Read_Error_Rate, from 0/0 when I first posted the smartctl output to 0/230256202 now. Unfortunately, this disk does not appear to support the selftest log.

swolebro@r815$ sudo smartctl -l selftest -d megaraid,0 /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.0.2-arch1-1] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test Log not supported

The -H does say it passes.

swolebro@r815$ sudo smartctl -H -d megaraid,0 /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.0.2-arch1-1] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: ATA return descriptor not supported by controller firmware
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

d.ALT · 2022-10-25 20:15:33

swolebro wrote:

I did try moving the SSD to a desktop with a SATA interface, and replicated the issue there.

If you want / can, try doing some tests with Seagate SeaChest , you'll find it on the AUR.
(It helped me a lot - in the past - with Hitachi and WesternDigital mechanical drives)

WARNING!!!
You're going to run Data Destructive Commands!
WARNING!!!

# SeaChest_SMART -d YOURSSD --captive --dstAndClean

--dstAndClean
Runs DST, then checks for an error and repairs the
error if possible. This continues until all errors
reported by DST are fixed, or when the error limit is
reached. The default limit is 50 errors.

or

# SeaChest_GenericTests -d YOURSSD --longGeneric --repairAtEnd

--longGeneric
This option will run a long generic read test on a
specified device. A long generic read test reads every
LBA on the device and gives a report of error LBAs at
the end of the test, or when the error limit has been
reached. Using the --stopOnError option will make this
test stop on the first read error that occurs.
The default error limit is 50 x number of logical
sectors per physical sector. Example error limits
are as follows:
512L/512P: error limit = 50
4096L/4096P: error limit = 50
512L/4096P: error limit = 400 (50 * 8)

Last edited by d.ALT (2022-10-25 20:15:50)

seth · 2022-10-25 20:45:23

I googled for "OCZ-VERTEX2" and got https://stephane.lesimple.fr/blog/how-o … ata-twice/ (though there're probably horros tories for every drive model)
For a basic integrity test:

LC_ALL=C pacman -Qkk | grep -v ', 0 altered files'

cfr · 2022-10-26 01:16:28

swolebro wrote:

Unfortunately, this disk does not appear to support the selftest log.

That's not what your earlier output showed.

swolebro wrote:

SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

But you've now run the test, so it should (hopefully) show you something.

swolebro · 2022-10-26 22:38:44

cfr wrote:

swolebro wrote:
Unfortunately, this disk does not appear to support the selftest log.
That's not what your earlier output showed.
swolebro wrote:
SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged.  [To run self-tests, use: smartctl -t]
But you've now run the test, so it should (hopefully) show you something.

Unless I'm misunderstanding something, you run a test with `-t short` or `-t long`, wait the specified amount of time, then request it with `-l selftest`. I did run both a short and a long test prior to making post #6, and as shown in post #8, when requesting it, it said it doesn't support logging the tests. I'm not sure what other output I would see if the feature was supported.

seth wrote:

I googled for "OCZ-VERTEX2" and got https://stephane.lesimple.fr/blog/how-o … ata-twice/ (though there're probably horros tories for every drive model)
For a basic integrity test:
LC_ALL=C pacman -Qkk | grep -v ', 0 altered files'

Surprisingly, it looks like the only files edited are ones that I touched in /etc. Huh. Good thing to check, though.

I have some sudden, unplanned travel I need to do that's going to take me away from this for 1-4 weeks. Normally I'd just VPN to my home network and WOL the server to keep working, but given the nature of the problem, that's not really an option. Thank you guys for your attention and input so far. I'll report back once I've gotten back to this done some more experimentation. My apologies for going AWOL, it's not anything I can control.

cfr · 2022-10-26 23:42:42

swolebro wrote:

Unless I'm misunderstanding something, you run a test with `-t short` or `-t long`, wait the specified amount of time, then request it with `-l selftest`. I did run both a short and a long test prior to making post #6, and as shown in post #8, when requesting it, it said it doesn't support logging the tests. I'm not sure what other output I would see if the feature was supported.

Try -a if that's what showed the selftest log information before. See if it still claims no test has been run or if there's something there.

swolebro · 2023-03-13 17:52:33

As an overdue follow-up to this post, it appears to be the problem is specifically with the SSD I was using. With fresh installs of both Arch and Ubuntu Server, working on both the 2U rack server (with the PERC H700 RAID) and on desktop (basic SATA), I was able to frequently replicate filesystem corruption by writing a file immediately prior to powering down. A different SSD did not demonstrate the problem in the same configurations. It's a little surprising (and anti-climatic), given the low lifetime writes and power-on time for the drive, but eh, hardware can be weird like that. I've got a functioning system now and can get back to work.

My apologies for the long delay - I was preoccupied with family matters. Thank you to everyone who took the time to help.

seth · 2023-03-13 20:36:52

\o/

The disk will probably cache the writes (that's normal) and fail to flush the cache before losing power (that's not)
Some devices come w/ an internal battery to allow them to sync the cache after power loss and if yours does, the battery might be disfunctional.

Please always remember to mark resolved threads by editing your initial posts subject - so others will know that there's no task left, but maybe a solution to find.
Thanks.

swolebro · 2023-03-13 21:16:20

I suspect an internal battery as well. But again, eh, hardware is weird.

I've updated the thread name to mark is as resolved. (I don't see a button or a checkbox for it anywhere, so I assume editing the name is the proper method.)

Arch Linux

#1 2022-10-24 22:48:29

[RESOLVED] `shutdown now` frequently causes journal/fsck errors

#2 2022-10-25 00:48:05

Re: [RESOLVED] `shutdown now` frequently causes journal/fsck errors

#3 2022-10-25 02:32:32

Re: [RESOLVED] `shutdown now` frequently causes journal/fsck errors

#4 2022-10-25 05:26:30

Re: [RESOLVED] `shutdown now` frequently causes journal/fsck errors

#5 2022-10-25 06:31:38

Re: [RESOLVED] `shutdown now` frequently causes journal/fsck errors

#6 2022-10-25 16:51:40

Re: [RESOLVED] `shutdown now` frequently causes journal/fsck errors

#7 2022-10-25 18:50:19

Re: [RESOLVED] `shutdown now` frequently causes journal/fsck errors

#8 2022-10-25 19:59:22

Re: [RESOLVED] `shutdown now` frequently causes journal/fsck errors

#9 2022-10-25 20:15:33

Re: [RESOLVED] `shutdown now` frequently causes journal/fsck errors

#10 2022-10-25 20:45:23

Re: [RESOLVED] `shutdown now` frequently causes journal/fsck errors

#11 2022-10-26 01:16:28

Re: [RESOLVED] `shutdown now` frequently causes journal/fsck errors

#12 2022-10-26 22:38:44

Re: [RESOLVED] `shutdown now` frequently causes journal/fsck errors

#13 2022-10-26 23:42:42

Re: [RESOLVED] `shutdown now` frequently causes journal/fsck errors

#14 2023-03-13 17:52:33

Re: [RESOLVED] `shutdown now` frequently causes journal/fsck errors

#15 2023-03-13 20:36:52

Re: [RESOLVED] `shutdown now` frequently causes journal/fsck errors

#16 2023-03-13 21:16:20

Re: [RESOLVED] `shutdown now` frequently causes journal/fsck errors

Board footer