You are not logged in.
I ran into a nasty looking error during a routine system upgrade, then rebooted because (see below). I now am faced with a hanging Dell logo every time I try to boot into linux-zen or linux-lts from Refind. I am utterly terrified that I am going to lose the work I have on this computer, which is not backed up because of reasons too long to get into. I've booted into a live USB, but I have no idea what to do next.
For context, I spent last night at the ER, and I haven't really slept in two. Doing system maintenance is hard right now, as is everything else. Any form of help would be enormously appreciated.
Last edited by Singdra (2022-10-07 17:05:44)
Offline
Mount the partitions, chroot into it and post the outputs of
journalctl -b-1
https://wiki.archlinux.org/title/List_o … n_services FWIW there's a known regression with certain intel igpus in the 5.19.12 kernel maybe check whether downgrading the kernel to 5.9.11 . However it would in general be nice to have info on what kind of error you saw exactly, you can also use --since and the like to filter the journal for a specific date. https://wiki.archlinux.org/title/System … ing_output
Offline
Thank you for the response. I was able to mount my linux filesystem (/dev/nvme0n1p2), chroot into it, and run journalctl. I'm not sure how to paste the results, though, especially since it's very long.
Offline
Read the links in my post. As long as you have a connection you can paste to a pastebin service.
Offline
Well, I verified that the files in my home directory are intact, so I'm not having a heart attack anymore. That's good.
I've tried uploading to a few pastebins, but I got this:
curl: (6) could not resolve host:
I'm connected to ethernet, and I'm able to ping 8.8.8.8. Running Nmcli yields "Could not create NMClient object: ould not connet: No such file or directory."
Last edited by Singdra (2022-10-04 14:55:17)
Offline
Got it. Forgot to copy my resolv.conf.
Journalctl -b-1 output: https://0x0.st/oJ9Q.txt
Journalctl --since="2022-10-04 6:00" output: http://ix.io/4chg
Pasted the URL since the text of the journalctl log is too long for a forum post, and used two different pastebins since 0x0 took issue with the second one. Also, unless I'm misunderstanding, I don't have a separate home partition.
Last edited by Singdra (2022-10-04 16:02:52)
Offline
This section starting at 7:20:47 seems like it could be potentially relevant:
Oct 04 07:20:47 cuvienen systemd[1]: session-1.scope: A process of this unit has been killed by the OOM killer.
Oct 04 07:20:47 cuvienen kernel: kswapd0 invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
Oct 04 07:20:47 cuvienen kernel: CPU: 9 PID: 133 Comm: kswapd0 Tainted: P OE 5.19.11-zen1-1-zen #1 9258f582030ae8d7ff0630d1be4a4611cc8b3838
Oct 04 07:20:47 cuvienen kernel: Hardware name: Dell Inc. Dell G15 5515/00VT1V, BIOS 1.2.0 06/08/2021
Oct 04 07:20:47 cuvienen kernel: Call Trace:
Oct 04 07:20:47 cuvienen kernel: <TASK>
Oct 04 07:20:47 cuvienen kernel: dump_stack_lvl+0x48/0x60
Oct 04 07:20:47 cuvienen kernel: dump_header+0x4a/0x1ff
Oct 04 07:20:47 cuvienen kernel: oom_kill_process.cold+0xb/0x10
Oct 04 07:20:47 cuvienen kernel: out_of_memory+0x27e/0x5e0
Oct 04 07:20:47 cuvienen kernel: balance_pgdat+0x9b2/0xd80
Oct 04 07:20:47 cuvienen kernel: kswapd+0x1fa/0x3c0
Oct 04 07:20:47 cuvienen kernel: ? sched_energy_aware_handler+0xb0/0xb0
Oct 04 07:20:47 cuvienen kernel: ? balance_pgdat+0xd80/0xd80
Oct 04 07:20:47 cuvienen kernel: kthread+0x13f/0x160
Oct 04 07:20:47 cuvienen kernel: ? kthread_complete_and_exit+0x20/0x20
Oct 04 07:20:47 cuvienen kernel: ret_from_fork+0x22/0x30
Oct 04 07:20:47 cuvienen kernel: </TASK>
Oct 04 07:20:47 cuvienen kernel: Mem-Info:
Oct 04 07:20:47 cuvienen kernel: active_anon:210516 inactive_anon:1355257 isolated_anon:0
active_file:34406 inactive_file:25859 isolated_file:0
unevictable:128 dirty:0 writeback:0
slab_reclaimable:43669 slab_unreclaimable:32823
mapped:66270 shmem:961249 pagetables:8025 bounce:0
kernel_misc_reclaimable:0
free:28684 free_pcp:3196 free_cma:0
Oct 04 07:20:47 cuvienen kernel: Node 0 active_anon:842064kB inactive_anon:5421028kB active_file:137624kB inactive_file:103436kB unevictable:512kB isolated(anon):0kB isolated(file):0kB mapped:265080kB dirty:0kB writeback:0kB shmem:3844996kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 253952kB writeback_tmp:0kB kernel_stack:23392kB pagetables:32100kB all_unreclaimable? no
Oct 04 07:20:47 cuvienen kernel: Node 0 DMA free:14336kB boost:0kB min:136kB low:168kB high:200kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Oct 04 07:20:47 cuvienen kernel: lowmem_reserve[]: 0 3116 7258 7258 7258
Oct 04 07:20:47 cuvienen kernel: Node 0 DMA32 free:52332kB boost:0kB min:28956kB low:36192kB high:43428kB reserved_highatomic:0KB active_anon:122268kB inactive_anon:2784500kB active_file:15732kB inactive_file:84736kB unevictable:444kB writepending:0kB present:3266188kB managed:3200152kB mlocked:444kB bounce:0kB free_pcp:3976kB local_pcp:420kB free_cma:0kB
Oct 04 07:20:47 cuvienen kernel: lowmem_reserve[]: 0 0 4142 4142 4142
Oct 04 07:20:47 cuvienen kernel: Node 0 Normal free:48068kB boost:0kB min:38488kB low:48108kB high:57728kB reserved_highatomic:0KB active_anon:719512kB inactive_anon:2636420kB active_file:121668kB inactive_file:18084kB unevictable:68kB writepending:16kB present:4426752kB managed:4249008kB mlocked:68kB bounce:0kB free_pcp:8808kB local_pcp:440kB free_cma:0kB
Oct 04 07:20:47 cuvienen kernel: lowmem_reserve[]: 0 0 0 0 0
Oct 04 07:20:47 cuvienen kernel: Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB (M) 3*4096kB (M) = 14336kB
Oct 04 07:20:47 cuvienen kernel: Node 0 DMA32: 76*4kB (UME) 434*8kB (UME) 343*16kB (UME) 105*32kB (UE) 121*64kB (UME) 70*128kB (UME) 44*256kB (UME) 23*512kB (UM) 0*1024kB 0*2048kB 0*4096kB = 52368kB
Oct 04 07:20:47 cuvienen kernel: Node 0 Normal: 462*4kB (ME) 815*8kB (UME) 892*16kB (UME) 255*32kB (UME) 87*64kB (UME) 44*128kB (UME) 23*256kB (UM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 47888kB
Oct 04 07:20:47 cuvienen kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Oct 04 07:20:47 cuvienen kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Oct 04 07:20:47 cuvienen kernel: 1021774 total pagecache pages
Oct 04 07:20:47 cuvienen kernel: 0 pages in swap cache
Oct 04 07:20:47 cuvienen kernel: Swap cache stats: add 0, delete 0, find 0/0
Oct 04 07:20:47 cuvienen kernel: Free swap = 0kB
Oct 04 07:20:47 cuvienen kernel: Total swap = 0kB
Oct 04 07:20:47 cuvienen kernel: 1927233 pages RAM
Oct 04 07:20:47 cuvienen kernel: 0 pages HighMem/MovableOnly
Oct 04 07:20:47 cuvienen kernel: 61103 pages reserved
Oct 04 07:20:47 cuvienen kernel: 0 pages cma reserved
Oct 04 07:20:47 cuvienen kernel: 0 pages hwpoisoned
Oct 04 07:20:47 cuvienen kernel: Tasks state (memory values in pages):
I would welcome any advice in terms of what steps to take next, either to make my system usable again, or to get the data in my home directory off my computer. I'm not quite sure how to proceed.
Offline
A whole bunch of AHCI/drive/disk access errors, maybe even buggy RAM (... maybe "simply" a side effect from running OOM that one time, but we need to verify that first). In the best case your cable came loose and you can just fix the connector, in the worst case your disk is basically toast and you should minimize writes to it, backup/read the data to a known good drive ASAP. What's the output of
smartctl -a /dev/$drivename$
If you have any form of a disk you can backup to, use something like ddrescue to get as much data off as possible.
Last edited by V1del (2022-10-05 00:56:15)
Offline
A whole bunch of AHCI/drive/disk access errors, maybe even buggy RAM (... maybe "simply" a side effect from running OOM that one time, but we need to verify that first). In the best case your cable came loose and you can just fix the connector, in the worst case your disk is basically toast and you should minimize writes to it, backup/read the data to a known good drive ASAP. What's the output of
smartctl -a /dev/$drivename$
If you have any form of a disk you can backup to, use something like ddrescue to get as much data off as possible.
I'm already in the process of backing everything up to an external drive using dd. Hopefully that doesn't damage it further, but I suppose there's no way to tell now that I'm several hundred gigs in. It's slow going, so I can post the smartctl output when it's done coping.
Which cable are you referring to?
Offline
smartctl -a /dev/$drivename$ output:
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.19.12-arch1-1] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: PC SN530 NVMe WDC 512GB
Serial Number: 21296Q461027
Firmware Version: 21113012
PCI Vendor/Subsystem ID: 0x15b7
IEEE OUI Identifier: 0x001b44
Total NVM Capacity: 512,110,190,592 [512 GB]
Unallocated NVM Capacity: 0
Controller ID: 1
NVMe Version: 1.4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 512,110,190,592 [512 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 001b44 4a4981dec5
Local Time is: Wed Oct 5 02:47:41 2022 UTC
Firmware Updates (0x14): 2 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x1e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size: 128 Pages
Warning Comp. Temp. Threshold: 83 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Namespace 1 Features (0x02): NA_Fields
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 3.50W 2.10W - 0 0 0 0 0 0
1 + 2.40W 1.60W - 0 0 0 0 0 0
2 + 1.90W 1.50W - 0 0 0 0 0 0
3 - 0.0250W - - 3 3 3 3 3900 11000
4 - 0.0050W - - 4 4 4 4 5000 39000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 2
1 - 4096 0 1
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 33 Celsius
Available Spare: 100%
Available Spare Threshold: 50%
Percentage Used: 2%
Data Units Read: 10,845,904 [5.55 TB]
Data Units Written: 7,770,404 [3.97 TB]
Host Read Commands: 109,832,542
Host Write Commands: 75,265,676
Controller Busy Time: 380
Power Cycles: 3,211
Power On Hours: 3,484
Unsafe Shutdowns: 188
Media and Data Integrity Errors: 0
Error Information Log Entries: 1
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged
Offline
That looks ok. Is that the only drive? SATA/AHCI errors are usually normal disks that show up under /dev/sdX rather than an nvme controller, where's your SWAP parttion located? If this is a one off issue due to insufficient RAM at the time, you might be able to simply reinstall the botched packages/potentially a
pacman -Syu linux
from the chroot with all stuff mounted properly.
Can you post
pacman -Qkk 2&>1 > /dev/null
to check what pacman knows about any corruption
Offline
Results of the second command. Here's what pacman knows: http://0x0.st/oJt3.csv
This is the only drive, aside from a loopback one. I inexplicably don't have a SWAP set up.
I haven't run the first command yet.
Offline
Darn that totally didn't generate the result I wanted, oh well. But from that output everything seems fine.
So your disk, and your data is generally ok. With properly mounted partitions/from the chroot run that command after which you should be able to boot again.
So the scary looking messages where likely indeed from running OOM and not having any SWAP to evade to. You probably really want to consider setting up some SWAP, it will help you in memory spke/starvation situations (maybe also investigate how you got there in the first place, how much RAM do you have, if my count based on the dmesg isn't off around 8 GB)? and which programs where open during the crashy situation? I see discord and likely some other browser which can well take up a good 6 - 7 GB on their own)
Offline
Sounds good, and thank you! I'll post the results after I get off of work and am able to take a swing at it. Possibly a bad question, but is there a reason to use pacman -Syu linux, as opposed to pacman -Syu linux-zen linux-lts or somesuch? Linux-zen is my usual kernel, and the one I was using when everything went downhill.
Great idea on the swap. I could have sworn I set up swap when I was doing my install, but I poked around and there's clearly no swap partition or swap file, and swap isn't on, so I clearly must have been mistaken. I'll probably set up swap before rebooting, just so I can avoid running into another OOM situation.
As for what caused this situation, I'll have to look into it. I've got 8 gigs of ram, which is normally fine for my purposes. I'm suspicious that discord is the culprit, given how many instances of it appear to be running simultaneously, and the fact that I was having trouble with it just before all this happened. Ideally I wouldn't have to use discord at all, but such is life. Assuming that I can boot, I'll try reinstalling discord and see if that fixes anything.
Offline
No luck. Ran pacman -Syu, but attempting to reboot into linux-zen yields the same hanging dell screen as before.
I did notice an absolute flood of error messages when rebooting from my arch recovery usb, as well as more AHCI errors when starting up my recovery usb.
Offline
Given the lack of ideas, does it make sense at this point just to do a full reinstall and see if that resolves anything? I've got everything on my hard drive backed up, and I've pulled a list of my installed pacakges, so I'm effectively safe in terms of data.
I don't relish the thought of getting a new laptop, but the number of AHCI and ACPI errors that keep cropping up make me worried that this is some sort of hardware issue.
Offline
At the risk of quadruple-posting, I'm going to offer a potential explanation. As a last ditch attempt, I tried a full reinstall of arch. I was in a hurry, so I used archinstall, which failed due to https://github.com/archlinux/archinstall/issues/1495. Fine, whatever. I restarted so that I could do it manually, and I noticed that I couldn't boot into my live USB. No problem, things were re-partitioned, so I probably destroyed my bootloader, and had a funky boot order set up. I tried to get into the boot menu.
Nothing.
Ok, UEFI menu?
Even more nothing.
I ended up on the phone with Dell customer service for over an hour, and we ended up dong everything up to and including a full NVRAM reset and restoring the BIOS. Every time, nothing. The machine was so unresponsive we couldn't even get error codes out of it. The technician I spoke to implied that not one, but multiple pieces of hardware might need to be replaced. I think @V1del's initial instinct was correct: At the very least, the motherboard is probably shot.
Marking this solved, although in this case, it's less "solved" and more "the computer done broke." Thank you for the help, and sorry for posing a problem without a real solution.
Offline