You are not logged in.
Hi,
I was faced with this weird UEFI boot problem on hardware which is a bit exotic for me. It is more likely than not the problem is upstream rather than Arch; however I'm writing here since they were all installed from Arch packages. Could be a HW issue, too.
Summary:
* HW: LSI SAS/SATA controller (Perc H310 flashed into IT mode) on a PowerEdge R820
* systemd-boot can not boot any UKI
* Refind can not boot any UKI, corrupts the esp while trying to write it's files, and loses track of the partition!
* Grub works! (yay!)
TL;DR -> Jump to the Problem -section; but read the other sections for full picture of the situation.
Background:
I acquired this old(ish) server for hobbyist/amateur use; a Dell PowerEdge R820. It has been configured with a PERC H310 SAS/SATA Raid controller. I will not dwell too much into the reasons, but trying to stay in reasonable budget a pair of consumer-grade SSDs was installed into it. The FW on the WD Greens was faulty and was upgraded (on a Windows machine). But the PERC FW or PowerEdge FW do not like these SSDs (also we suspect the controller is probably too old to use TRIM or does it "behind the users back", insists on Dell brander or certified SSDs only etc.). It was established nearly impossible to get the SSDs working with the H/W Raid properly (although they kind of worked, and the system was bootable in the end after HW updates, they were blinking the amber LED and there was a frontpanel error, stating the disks had failed - despite seemingly working for at least small writes).
So the controller was flashed into IT mode, along with the UEFI boot ROM (according to the instructions and FW from this page). After this the SSDs work fine; they were partitioned (1 x esp and 1 x data partition for ZFS on each ). Arch was installed more or less according to the wiki https://wiki.archlinux.org/title/Instal … nux_on_ZFS , adapting a bit so the disks (or the data partitions) are in RAIDZ configuration. I chose systemd-boot as the bootloader, and it was installed - along with refind for rescue situations.
Problems surface when trying to boot.
Setup so far:
Partition scheme:
Model: ATA WD Green 2.5 480 (scsi)
Disk /dev/sda: 480GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
1 1049kB 1000MB 999MB fat32 EFI boot, esp
2 1000MB 480GB 479GB rootZFS msftdata
Model: ATA WD Green 2.5 480 (scsi)
Disk /dev/sdb: 480GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
1 1049kB 1000MB 999MB fat32 boot, esp
2 1000MB 480GB 479GB rootZFS msftdata
NOTE: The second esp partition (on sdb) - which is supposed to mirror the other one via sync scripts - was removed for debugging reasons, as I believe some FW and/or bootloaders might get confused about that. The same issues persisted with only one esp.
The controller info from lspci:
02:00.0 RAID bus controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)
Subsystem: Dell SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 41
NUMA node: 0
Region 0: I/O ports at fc00 [size=256]
Region 1: Memory at dd7fc000 (64-bit, non-prefetchable) [size=16K]
Region 3: Memory at dd780000 (64-bit, non-prefetchable) [size=256K]
Expansion ROM at dc800000 [disabled] [size=512K]
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0W
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s, Exit Latency L0s <64ns
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x8
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range BC, TimeoutDis+ NROPrPrP- LTR-
10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
AtomicOpsCtl: ReqEn-
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [d0] Vital Product Data
pcilib: sysfs_read_vpd: read failed: No such device
Not readable
Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [c0] MSI-X: Enable+ Count=15 Masked-
Vector table: BAR=1 offset=00002000
PBA: BAR=1 offset=00003800
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [138 v1] Power Budgeting <?>
Kernel driver in use: mpt3sas
Kernel modules: mpt3sas
ESP mounted to /efi, the relevant hooks in mkinitcpio config and the profile for linux-lts configured to generate UKIs (this is probably irrelevant as the problems arise before either bootloader can boot the UKI). The UKIs seem to be (and were as I later found out) correctly installed in esp (mounted at /efi) /EFI/Linux
The problem:
Systemd-boot can not boot the UKI. It will fail with "Error loading \EFI\Linux\arch-linux-lts.efi". Interestingly, retrying systemd-bootx64.efi from within refind without a reboot in between, the error message changes. I get various but similar errors, such as "Invalid loader file!" (notice: I never get this on the first attempt).
If I retry loading the UKI or systemd-boot a few (around 3) times, ReFind seems to lose track of the esp totally. It can not find and of the UKIs.
I can not boot via ReFind, either after a fresh reboot (without running systemd-bootx68.efi). It gives a similar error. Interestingly, Refind tries to write into the esp efivars, but corrupts the partition (partially) while doing so:
By corruption I mean this (while booted into a live environment):
[root@archiso boot]# ls -l
ls: cannot access 'refind-vars': Input/output error
total 30980
drwxr-xr-x 8 root root 4096 Nov 9 01:53 EFI
-rwxr-xr-x 1 root root 7360512 Aug 8 21:19 intel-ucode.img
drwxr-xr-x 3 root root 4096 Nov 9 02:15 loader
d????????? ? ? ? ? ? refind-vars
-rwxr-xr-x 1 root root 236 Nov 9 01:44 refind_linux.conf
-rwxr-xr-x 1 root root 12790144 Nov 9 01:05 vmlinuz-linux
-rwxr-xr-x 1 root root 11557248 Nov 9 01:00 vmlinuz-linux-lts
This can be fixed by running fsck, but at this point the esp was mounted into /boot, but later I changed the setup so that esp is in /efi and makefs.vfat the esp again - but I got the same bug from Refind again.
Scratching my head and retrying again, I just could not boot with systemd-boot loader nor ReFind. This issue can be reproduced 100% of the time.
Then I realized I can boot from the Arch Installation Media (grub) without issue. I tried chainloading the UKI from Grub - and lo and behold, the Arch installation boots! I installed grub onto the esp, and can similarly chainload the UKI.
Sidenote: The system has 4xXeon E5-4620 and 512GiB of RAM. As this is a really weird issue, I ran (the Lenovo) diagnostics from the LifeCycle controller. It has quite some lengthy stress tests for the RAM, which finished without any errors.
My thoughts:
Theory: for some reason ReFind and systemd-boot try to read and write from/to wrong parts of the esp. But why?
1. The UEFI boot ROM on the LSI controller is buggy (but why does grub work?!??)
2. systemd-boot and Refind have a similar and a bit obscure bug, which for some reason triggers on this HW.
3. There is faulty RAM somewhere?!??
4. WD Green bugs?!??
Something else?!?? Any thoughts?
I still have a concern about stability, as I don't know why the boot process fails. I would like to make sure the system can be booted in a stable manner with GRUB2 (I have booted it many, but less than 10 times), however. On the other hand, it will not need a lot of reboots considering the intended use. But if the underlying issue is more severe, it might start to affect the server in the long run (although, it's a hobbyist/amateur project, it would still be nice to have it in working order).
I have no idea how to debug this kind of issue in systemd nor ReFind. However if someone can give pointers how to get useful information out of the bootloading process I might do that (it does have a serial port, connected to an iDRAC, however I don't know how to use that as iDRAC is a new thing to me - and disabled for the time being).
I could still try to test (for sake of curiosity) making a manual entry in the UEFI vars with efibootmgr for the UKI. Considering this is the server (and not many Kernels, alternative OSes etc. to begin with), it might have been the most sensible thing to do in the first place, instead of using any kind of bootloader/manager.
Cheers!
Last edited by Wild Penguin (2023-11-10 15:58:12)
Offline
If you add storage of some sort that does not use the LSI controller move the ESP to that storage can systemd-boot and refind then boot or does it require the LSI controller's boot option ROM to be disabled?
Offline
The MB does not have anyother storage options besides the LSI/RAID controller. There are no SATA ports etc. in there (there is one for the optical drive). I don't have any other storage option handy or easily mountable in there, besides USB sticks; but, I could try if the issue persists on USB sticks.
What I find suspicious is the Inherited from firmware part in the VFAT column in this table. Somehow the VFAT driver from the UEFI can load the booloader which does not play nicely with the UEFI - however, if I interpret the table correctly, GRUB has it's own FAT driver which takes over?
EDIT: Some brainfarts. It was not at all clear what I was trying to say.
Last edited by Wild Penguin (2023-11-11 00:24:01)
Offline
What I find suspicious is the Inherited from firmware part in the VFAT column in this table. Somehow the VFAT driver from the UEFI can load the booloader which does not play nicely with the UEFI - however, if I interpret the table correctly, GRUB has it's own FAT driver which takes over?
Yes, both systemd-boot and rEFInd just use whatever the UEFI gives them. GRUB has its own drivers.
This smells like a firmware issue.
Offline