Several problems after disk failure: systemd units do not start, ...

lepokle · 2024-05-10 08:55:30

Hello,

I'm using a setup based on an encrypted zfs partition. This worked many years until the zpool gets somehow corrupted. I've added an additional disk, created a new zpool and copied over the original pool as much as possible. Afterwards I've also recreated my UEFI partition on the new disk, formatted the old one and added the old one as additional vdev to the new pool (smartctl did not report any errors for the old disk). I've also reinstalled all packages.

Since then I'm observing the following issues:

NetworkManager and cups units does not start. They are shown as dead but starting them is no problem:

○ cups.service - CUPS Scheduler
     Loaded: loaded (/usr/lib/systemd/system/cups.service; enabled; preset: disabled)
     Active: inactive (dead)
TriggeredBy: ○ cups.socket
       Docs: man:cupsd(8)

/boot is not mounted. I've already found a post stating that partition was dirty. I've fixed it with dosfsck but it does not help. Now event the unit boot.mount does not exist anymore.
The only failing unit is tpm2-abrmd.service. I've found out that permission of /dev/tpm0 are not correct:
```
crw-rw---- 1 963 root 10, 224 May 10 10:36 /dev/tpm0
```
My tss user has uid 974 and my udev rules are in place. After calling
```
udevadm trigger
```
permissions are correctly set so it seems that even udev is not running as it should be.

I've looked at

journalctl -xw

and tried different systemctl commands. I've scrolled through

dmesg

but I'm not getting what is going on here.

Any diagnostic held or idea would be helpful. Thank you!

seth · 2024-05-10 14:21:35

TriggeredBy: ○ cups.socket

The service starts when you try to use it.

Now event the unit boot.mount does not exist anymore.

fstab/lsblk -f ?

my udev rules are in place

Try to change the rule to sth. more obvious, eg. "touch /tmp/tpm.udev" and see whether that's generated (ie. is the rule not applied or does the rule use the wrong UID)

lepokle · 2024-05-10 14:27:40

Hi,

unfortunately the service seems not to start anymore when I want to print. I have to call

systemctl start cups

first (this was not necessary before).

Here is the ouptut:

leo@lepobookng ~ % lsblk -f
NAME        FSTYPE     FSVER LABEL            UUID                                 FSAVAIL FSUSE% MOUNTPOINTS
nvme1n1                                                                                           
├─nvme1n1p1                                                                                       
├─nvme1n1p2 zfs_member 5000  root_pool_2m4iog 9068665034321893512                                 
└─nvme1n1p3                                                                                       
nvme0n1                                                                                           
├─nvme0n1p1 vfat       FAT32                  91C7-E83D                               3.9G     2% /boot
├─nvme0n1p2 zfs_member 5000  root_pool_2m4iog 9068665034321893512                                 
└─nvme0n1p3

leo@lepobookng ~ % cat /etc/fstab
# Static information about the filesystems.
# See fstab(5) for details.

# <file system> <dir> <type> <options> <dump> <pass>

# ESP/BOOT partition
UUID=91C7-E83D					/boot		vfat   	rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,utf8,errors=remount-ro	0 2

# SWAP space
/dev/mapper/swap				none		swap	defaults												0 0

Running

mount /boot

after boot works without any errors.

The rule contains the right UID:

leo@lepobookng ~ % cat /usr/lib/udev/rules.d/60-tpm-udev.rules
# tpm devices can only be accessed by the tss user but the tss
# group members can access tpmrm devices
KERNEL=="tpm[0-9]*", TAG+="systemd", MODE="0660", OWNER="tss"
KERNEL=="tpmrm[0-9]*", TAG+="systemd", MODE="0660", GROUP="tss"

It seems that some basic things are not run on startup.

seth · 2024-05-10 14:32:31

That's not a UID but a user/groupname but since fstab is apparently not parsed (or does the swap activate?)…

Please post your complete system journal for the boot:

sudo journalctl -b | curl -F 'file=@-' 0x0.st

lepokle · 2024-05-10 15:37:07

No, swap is not activated as well. I've setup crypt swap but it isn't loaded either.

root@lepobookng ~ # cat /etc/crypttab 
# Configuration for encrypted block devices.
# See crypttab(5) for details.

# NOTE: Do not list your root (/) partition here, it must be set up
#       beforehand by the initramfs (/etc/mkinitcpio.conf).

# <name>       <device>                                     <password>              <options>
# home         UUID=b8ad5c18-f445-495d-9095-c9ec4f9d2f37    /etc/mypassword1
# data1        /dev/sda3                                    /etc/mypassword2
# data2        /dev/sda5                                    /etc/cryptfs.key
# swap         /dev/sdx4                                    /dev/urandom            swap,cipher=aes-cbc-essiv:sha256,size=256
# vol          /dev/sdb7                                    none

swap /dev/disk/by-id/nvme-SAMSUNG_MZVL22T0HBLB-00B00_S677NF0W500229-part3 /dev/urandom swap,cipher=aes-cbc-essiv:sha256,size=256

Here is the log:
http://0x0.st/X8Ym.txt

seth · 2024-05-11 16:12:28

It looks like zed runs very late and you might end up looking at a different FS than the starting system?

What does the system look like if you're only booting the rescue.target?

lepokle · 2024-05-15 04:55:06

Thanks for the tip. I had the following findings after booting into rescue.target:

systemd had some initial setup questions -> answered
owner/group on /dev/tpm* are correct
/etc/fstab is empty
/etc/crypttab is empty
journalctl -b is available at http://0x0.st/XKD4.txt
no failed systemd units
/boot not mounted (that should be expected since /etc/fstab was empty, correct?)

After switching to graphical.target I get the old symptoms:

no /boot
wrong ownership of /dev/tpm* !?!
some units are not starting
...

I've rechecked mkinitcpio.conf and added things present in .pacnew. I've regenerated initramfs-linux.img and checked it:

root@lepobookng /tmp/t # lsinitcpio -x /boot/initramfs-linux.img 
root@lepobookng /tmp/t # ll
total 52
-rw-r--r-- 1 root root     4 May 15 06:47 VERSION
lrwxrwxrwx 1 root root     7 May 15 06:47 bin -> usr/bin
-rw-r--r-- 1 root root  3306 May 15 06:47 buildconfig
-rw-r--r-- 1 root root   122 May 15 06:47 config
-rw-r--r-- 1 root root 10558 May 15 06:47 consolefont.psfu
drwxr-xr-x 2 root root    40 May 15 06:47 dev
-rw-r--r-- 1 root root     2 May 15 06:47 early_cpio
drwxr-xr-x 4 root root   220 May 15 06:47 etc
drwxr-xr-x 2 root root   120 May 15 06:47 hooks
-rwxr-xr-x 1 root root  3325 May 15 06:47 init
-rw-r--r-- 1 root root 15577 May 15 06:47 init_functions
drwxr-xr-x 3 root root    60 May 15 06:47 kernel
-rw-r--r-- 1 root root  2567 May 15 06:47 keymap.bin
-rw-r--r-- 1 root root     0 May 15 06:47 keymap.utf8
lrwxrwxrwx 1 root root     7 May 15 06:47 lib -> usr/lib
lrwxrwxrwx 1 root root     7 May 15 06:47 lib64 -> usr/lib
drwxr-xr-x 2 root root    40 May 15 06:47 new_root
drwxr-xr-x 2 root root    40 May 15 06:47 proc
drwxr-xr-x 2 root root    40 May 15 06:47 run
lrwxrwxrwx 1 root root     7 May 15 06:47 sbin -> usr/bin
drwxr-xr-x 2 root root    40 May 15 06:47 sys
drwxr-xr-x 2 root root    40 May 15 06:47 tmp
drwxr-xr-x 5 root root   140 May 15 06:47 usr
drwxr-xr-x 2 root root    60 May 15 06:47 var
root@lepobookng /tmp/t # cat etc/fstab 
# Static information about the filesystems.
# See fstab(5) for details.

# <file system> <dir> <type> <options> <dump> <pass>

# ESP/BOOT partition
UUID=91C7-E83D					/boot		vfat   	rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,utf8,errors=remount-ro	0 2

# SWAP space
/dev/mapper/swap				none		swap	defaults												0 0

root@lepobookng /tmp/t #

So fstab seems to be correctly included. I've deleted my UKI files to be sure they got regenerated:

root@lepobookng /tmp/t # ll /boot/EFI/Linux/          
total 150272
-rwxr-xr-x 1 root root 76936192 May  9 14:53 archlinux-debug.efi
-rwxr-xr-x 1 root root 76936192 May  9 14:53 archlinux-linux.efi
root@lepobookng /tmp/t #

However, I have no idea why date is 9th of May and not the current one.

I've ensured that the correct, current image is booted:

root@lepobookng /tmp/t # efibootmgr 
BootCurrent: 0001
Timeout: 0 seconds
BootOrder: 0001,001A,001B,001C,001D,001E,001F,0020,0021,0022,0023,0024
Boot0001* Arch Linux (new disk)	HD(1,GPT,8450fbdb-532a-4a6c-bd81-4bc03c18f71d,0x800,0x800800)/EFI\Linux\archlinux-linux.efi
Boot0010  Setup	FvFile(721c8b66-426c-4e86-8e99-3457c46ab0b9)
Boot0011  Boot Menu	FvFile(126a762d-5758-4fca-8531-201a7f57f850)
Boot0012  Diagnostic Splash Screen	FvFile(a7d8d9a6-6ab0-4aeb-ad9d-163e59a7a380)
Boot0013  Lenovo Diagnostics	FvFile(3f7e615b-0d45-4f80-88dc-26b234958560)
...

root@lepobookng /tmp/t # blkid 
/dev/nvme0n1p3: PARTUUID="07e27fe0-1050-435d-9107-5ab644b645fc"
/dev/nvme0n1p1: UUID="9E19-6832" BLOCK_SIZE="512" TYPE="vfat" PARTUUID="9b096c68-7885-498c-a5e7-1f364fad8c4c"
/dev/nvme0n1p2: LABEL="root_pool_2m4iog" UUID="9068665034321893512" UUID_SUB="7445512687706962544" BLOCK_SIZE="4096" TYPE="zfs_member" PARTUUID="41c47b83-6df2-40d0-983f-8fc35a1eca64"
/dev/nvme1n1p2: LABEL="root_pool_2m4iog" UUID="9068665034321893512" UUID_SUB="9685926068953788758" BLOCK_SIZE="4096" TYPE="zfs_member" PARTUUID="9932d973-bfee-4bce-b423-b57f6e60e3d4"
/dev/nvme1n1p3: PARTUUID="11d82a81-50b3-41e7-ab6c-d96706038fe2"
/dev/nvme1n1p1: UUID="91C7-E83D" BLOCK_SIZE="512" TYPE="vfat" PARTUUID="8450fbdb-532a-4a6c-bd81-4bc03c18f71d"

I've even reformatted partition 1 on the old disk (nvme1n1p1) to be sure that no image/kernel from the old partition could be loaded.

Zpool is ok as well:

root@lepobookng /tmp/t # zpool status
  pool: root_pool_2m4iog
 state: ONLINE
  scan: resilvered 676G in 00:17:36 with 0 errors on Fri Mar 29 13:06:28 2024
config:

	NAME              STATE     READ WRITE CKSUM
	root_pool_2m4iog  ONLINE       0     0     0
	 mirror-0        ONLINE       0     0     0
	   nvme1n1p2     ONLINE       0     0     0
	   nvme0n1p2     ONLINE       0     0     0

errors: No known data errors

seth · 2024-05-15 20:44:55

seth wrote:

It looks like zed runs very late and you might end up looking at a different FS than the starting system

I can't really explain https://wiki.archlinux.org/title/ZFS#Automatic_Start or what should™ be done here, but if you end up ignoring the fstab from the pre-zed environment you'd have to update the /boot mountpoint after loading the pool that changes the root FS *somehow*
I assume a similar issue will affect /dev/tpm/* because zed will just create/mount a new devfs?

Do you have the zfs hook in your mkinitcpio.conf?

Arch Linux

#1 2024-05-10 08:55:30

Several problems after disk failure: systemd units do not start, ...

#2 2024-05-10 14:21:35

Re: Several problems after disk failure: systemd units do not start, ...

#3 2024-05-10 14:27:40

Re: Several problems after disk failure: systemd units do not start, ...

#4 2024-05-10 14:32:31

Re: Several problems after disk failure: systemd units do not start, ...

#5 2024-05-10 15:37:07

Re: Several problems after disk failure: systemd units do not start, ...

#6 2024-05-11 16:12:28

Re: Several problems after disk failure: systemd units do not start, ...

#7 2024-05-15 04:55:06

Re: Several problems after disk failure: systemd units do not start, ...

#8 2024-05-15 20:44:55

Re: Several problems after disk failure: systemd units do not start, ...

Board footer