You are not logged in.
Im getting system freezes on average every few hours. It can happen in 20 minutes or it can take 12 hours. I have not gotten the system to run stable beyond a 12hr runtime or so. Appears to me a GPU issue. This exact same set of hardware ran Arch flawlessly 5-6 months back. The issue makes the system almost unusable.
I am using multiple screens. Four to be exact.
Not all freezes happen in the same fashion. Nor does any application I commonly use (for ex chromium, gimp, vlc, telegram) need to be running to make the freeze occur.
First type of freeze causes all displays to freeze instantly. Usually the gpu fans will then run really high rpm when this happens. Only way to get them to stop is to power down using power button.
Here is a journalctl -b -1 afterwards:
https://0x0.st/XrI9.txt
Second type of freeze happens more slowly. Here my non primary, (i.e. screens set in kde plasma display config as auxillary to the main screen / lower priority) become non responsive. I can still move my cursor around on the main screen without issue but any application I try to interact with is totally unresponsive. If I try to click a few times then the cursor also freezes on the main screen. Sometimes I can change tty and reboot the system at this point, but usually I cannot and I must use the power button to force reset.
Here is a journalctl -b -1 afterwards:
https://0x0.st/XrIe.txt
Chromium also crashes very frequently. In my system monitor I appear to have 60+ processes called "chromium --type=renderer --crashpad-handler-pid=5162 --enable-crash-reporter=,Arch Linux --change-stack-guard-on-fork=enable --lang=en-US....." runnning at any given time. This seem unusual to me and I dont recall this being the case in the past with this setup.
Please let me know what other info i can provide to clarify.
Any suggestions are greatly appreciated.
Last edited by qherring (2024-03-22 03:14:03)
Offline
power down using power button.
Means the journal is likely useless, but https://0x0.st/XrI9.txt is 142MB!
https://0x0.st/XrIe.txt is 6MB worth of mostly chromium crashes.
As immediate tests, revert to the LTS kernel and nvidia-{dkms,utils} 545xx or 535xx from the ALA (they won't build against 6.8 w/o further patches), there're known issues w/ 550xx, but the backtraces look different from anything in your journal, though zram might be a factor here.
If that doesn't stabilize the system, run an extensive memtest86+ check (16+h, an entire weekend - or until you hit errors ;-)
Offline
power down using power button.
Means the journal is likely useless,
I have this journal (create a day or so prior to the others). In this case I was able to reboot, I believe this was one rare instance where i could do so by switching tty
I will try to load LTS and revert the suggested nvidia pkgs
Offline
revert to the LTS kernel
Im actually already on LTS
Offline
Right, so you can jsut downgrade the nvidia drivers.
There's nothing overly suspicious in the last journal, it's only 9 minutes - the only thing that *might* hint at trouble is that you're shutting down the system about 2 minutes after baloo (kde file indexer) starts.
Offline
Downgrading nvidia pkgs didnt seems to make any difference.
Yesterday my keyboard and mouse became unresponsive, screens did not freeze.
Only way i could interact at this point was to short press (not hold) the power button to trigger the shutdown countdown prompt. Should correspond to this in the following journal: Mar 23 22:06:16 mundofundo systemd-logind[2024]: Power key pressed short.
and this shortly before:
Mar 23 22:04:32 mundofundo kernel: DMAR: DRHD: handling fault status reg 3
Mar 23 22:04:32 mundofundo kernel: DMAR: [DMA Read NO_PASID] Request device [00:14.0] fault addr 0x404c41c4000 [fault reason 0x04] Access beyond MGAW
I will try the memtest and see what happens
Offline
Mar 23 22:04:32 mundofundo kernel: DMAR: DRHD: handling fault status reg 3
Mar 23 22:04:32 mundofundo kernel: DMAR: [DMA Read NO_PASID] Request device [00:14.0] fault addr 0x404c41c4000 [fault reason 0x04] Access beyond MGAWTry to cause this w/
intel_iommu=offresp.
iommu=softOffline
If that doesn't stabilize the system, run an extensive memtest86+ check (16+h, an entire weekend - or until you hit errors ;-)
Well Im finally getting around to running the memtest, and theres quite a few errors. About 16hr have elapsed and about 80% completed, so far 630 errors...
My btrfs subvols are now mounting readonly. From what I understand RAM issues frequently corrupt btrfs so pretty likely the RAM is to blame.
Offline
Memory issues cause problems all over the place.
Try to downclock it and/or use more conservative timings (both in the BIOS/UEFI) and make sure it's not a temperature issue.
Offline
Thanks. Shouldnt be temp related, I have tons of fans and a mesh case.
I have XMP disabled. When I tested all 128GB memtest reported 4800MHz (also what bios shows), now I am trying to test only 64GB (only 2 DIMMs filled) and memtest shows 4000MHz (I didnt change anything and bios still reports 4800, so not really sure why its downclocking on its own for the test) - no error so far on 3rd pass which is magnitudes better already.
The 128GB set is actually two 2x32GB, and potentially there could be compatibility issues running both 2x32 at once. The manufacturer's most recent QVL specs a 4x48GB now so I may try that if I cannot get the current 128GB to work after downclocking.
Offline
Also pay attention to how the DIMMs are distributed, esp. if they're not from the same vendor/brand/batch - you want the equal ones in the same bank (often color coded on the board) - but the boards manual is authorative itr.
Offline
Updating MB BIOS seemed to fix everything hardware wise. Very strange.
Just ran a memtest again w/ 4800Mhz and no errors with all four DIMMs. They are all identity make and model of RAM.
My btrfs root partition would not mount on boot due to "cannot read superblock" error
On another system I ran
btrfs rescue super-recoveron the btrfs partition
it tells me "All supers are valid, no need to recover”,
so I ran
btrfs rescue zero-logMy drive now mounts sorta fine, however now its always read only... I can pull any data off it I want and nothing appears to be corrupted or missing.
dmesg has plenty of error about missing chunk map et al.
https://pastebin.com/bxDtUVuL
I just ran
btrfs check and things look bad, tons of mapping errors and mismatches:
https://pastebin.com/SBRSBABZ
Any way to save this filesystem, or should i just jump ship and reinstall arch?
seems I should run a btrfs scrub first before trying btrfs check --repair (seeing as it doesn't have a good success rate)
There is no data on the drive that isnt backed up so im not worried about further corruption.
Last edited by qherring (2024-04-17 02:48:38)
Offline
Unfortunately scrub fails to run
dmesg:
[ 4428.503615] BTRFS warning (device dm-1): checksum error at logical 32630964224 on dev /dev/mapper/root, physical 28344385536, root 1051, inode 1255207, offset 0, length 4096, links 1 (path: usr/lib/python3.11/site-packages/pydantic/v1/networks.py)
[ 4431.201507] BTRFS critical (device dm-1): unable to find chunk map for logical 141013395488768 length 4096
[ 4431.201515] BTRFS critical (device dm-1): unable to find chunk map for logical 141013395488768 length 16384
[ 4431.201519] BTRFS critical (device dm-1): unable to find chunk map for logical 141013395488768 length 16384
[ 4431.203075] BTRFS info (device dm-1): scrub: not finished on devid 1 with status: -5smartctl -a on the device tells me all is healthy
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSEDLast edited by qherring (2024-04-17 03:41:18)
Offline
smartctl -a on the device tells me all is healthy
That line and an empty bag are worth about the bag, please post the full output.
"btrfs check" defaults to a re-only check, it won't make any errors go away.
You'd first BACKUP (you have, I'm just pointing that out for future readers), then MAKE SURE YOU'VE A COMPLETE BACKUP
then repair and then scrub as final step.
However:
ERROR: failed to repair root items: Input/output error=> revisit the smartctl output and also system journal/dmesg about the nature of the IO error.
Offline
smartctl -a
https://pastebin.com/b7aK7Ltm
btrfs scrub start -B /mnt
Starting scrub on devid 1
ERROR: scrubbing /mnt failed for device id 1: ret=-1, errno=5 (Input/output error)
scrub canceled for 9dac4ee8-4d90-488c-adc8-7f4056899071
Scrub started: Wed Apr 17 11:42:16 2024
Status: aborted
Duration: 0:00:17
Total to scrub: 34.10GiB
Rate: 2.00GiB/s
Error summary: csum=1
Corrected: 0
Uncorrectable: 1
Unverified: 0dmesg post scrub:
https://pastebin.com/edqMEy1s
journalctl since mounting the disk:
https://pastebin.com/XhSzrfN5
Offline
The IO errors are from the bus, probably not harmful but in general see https://wiki.archlinux.org/title/Solid_ … leshooting and also try pcie_aspm=off
The disk looks ok but the bus errors *might* at some point caused the corruption (the ones logged were corrected, though)
There's also no HW IO error logged during the scrub, it probably just refers to the pending FS error.
We still lack an IO error record from the repair attempt, though?
Offline