You are not logged in.
I've been experiencing an error that typically starts with a "PCIe link lost" error or an "illegal qc_active transition" error.
Originally these errors were causing hangups that led to a reboot, and on reboot I would often see hardware errors until I did a cold start. Through a significant amount of troubleshooting I was able to stabilize my system such that it does not reboot or freeze after a PCIe link lost error. However, the link lost errors are still fairly disruptive as they cause my NIC to drop out, and needs manual intervention to recover them. So I'm looking to eliminate that while minimizing any performance lost (eg from downclocking RAM)
You will see in the logs below that the drive my system is installed on (ata6) also fails, but is able to re-establish the connection on it's own.
I have the following cmdline applied:
BOOT_IMAGE=/vmlinuz-linux root=UUID=bc19042c-984a-411a-ab5b-9784d244bb0c rw loglevel=3 pcie_port_pm=off pcie_aspm.policy=performance processor.max_cstate=1 amd_iommu=offamd_iommu=off was added as an attempt to make recovery of the NIC cleaner however it still required I run modprobe igb to reconnect the NIC.
And the following BIOS settings applied:
Global C States off
PBO Off - enabling PBO alone results in reboots again
DF C states off
XMP Disabled with manually set values for:
- MCLK = 3600
- FCLK = 1800
- VDDG_IOD = 1.05 VSOC
- CCD = 900 mV
- VSOC = 1.10 V
- tCAS-tRCD-tRP-tRAS = 18-22-22-44
- All other subtimings still on auto
The longest session without a PCIe link lost was with XMP disabled. However it was 6 hours, and I've had them as infrequent as 4 hours in, so it was possibly just luck that an error did not occur.
I am dual booting with my windows installation on an entirely separate physical drive from Arch, and arch lives on a 500GB partition on my 4TB SATA SSD drive. This drive is only accessed by windows for some games, and it's ~50-60% full. Windows has been entirely stable for me and I have not seen any WHEA errors even after an overnight CoreCycler run. I confirmed powercfg /H is set to disabled in windows using powercfg /a.
I've had my computer on but in standby/completely idle for up to two days without any crashes or errors. The screen was off on the lock screen. Last night I disabled the logout and screen shutoff after X minutes, ran a script to apply a semi-randomized load and that resulted in the same error after about 3-4 hours, and I was able to reset the NIC using "sudo modprobe -r igb && sleep 1 && sudo modprobe igb"
I've reviewed and attempted fixes from these posts:
pcie_port_pm=off and pcie_aspm.policy=performance kernel params applied
Disabling C-states and "power Supply Idle Control"
similar issue for Proxmox user - unresolved
MC5_Status errors - suggestion to add vcore offset
Here are some other fixes I've tried:
Updated BIOS - on latest version F40
Load optimized defaults in BIOS
re-seating and reapplying thermal paste to CPU
- temps have never been out of the ordinary
- all cores under load for 15 minutes I barely hit 70 deg C
Cool&Quiet on and off
amd_pstate=active, with global c states and cool&quiet on
My first 3 added kernel params (see above) with global c states on
No kernel params or BIOS settings changed except for PBO off
Global C states off but no kernel params
+0.012 and +0.024 Voltage offsets on CPU
- Vcore voltage peaking at 1.48 V single core; 1.37 V all cores
Static or offset voltages on V SoC
Static voltages applied to DRAM
- auto seems to supply enough voltage and changing this alone did not prevent reboots
Removed USB devices or adjusted which USB slots were used
- Some previous crashes would cause USB ports to fail alongside the NIC
Set PCIe to be Gen3 in BIOS (instead of Auto)
Here are my system specs:
[10:18:54] [adrian@red-october-arch troubleshooting]$ sudo inxi --full
System:
Host: red-october-arch Kernel: 6.18.2-arch2-1 arch: x86_64 bits: 64
Desktop: KDE Plasma v: 6.5.4 Distro: Arch Linux
Machine:
Type: Desktop System: Gigabyte product: X570 AORUS ELITE v: -CF serial: N/A
Mobo: Gigabyte model: X570 AORUS ELITE serial: N/A Firmware: UEFI
vendor: American Megatrends LLC. v: F40 date: 10/28/2025
CPU:
Info: 12-core model: AMD Ryzen 9 3900X bits: 64 type: MT MCP cache:
L2: 6 MiB
Speed (MHz): avg: 1746 min/max: 563/4674 cores: 1: 1746 2: 1746 3: 1746
4: 1746 5: 1746 6: 1746 7: 1746 8: 1746 9: 1746 10: 1746 11: 1746 12: 1746
13: 1746 14: 1746 15: 1746 16: 1746 17: 1746 18: 1746 19: 1746 20: 1746
21: 1746 22: 1746 23: 1746 24: 1746
Graphics:
Device-1: NVIDIA TU106 [GeForce RTX 2060 Rev. A] driver: nvidia v: 590.48.01
Device-2: Logitech C922 Pro Stream Webcam driver: snd-usb-audio,uvcvideo
type: USB
Display: unspecified server: X.Org v: 24.1.9 with: Xwayland v: 24.1.9
driver: X: loaded: nvidia unloaded: modesetting
gpu: nv_platform,nvidia,nvidia-nvswitch resolution: 1: 2560x1440~165Hz
2: 2560x1440~155Hz
API: EGL v: 1.5 drivers: nvidia platforms: gbm
API: OpenGL v: 4.6.0 vendor: nvidia v: 590.48.01 renderer: NVIDIA GeForce
RTX 2060/PCIe/SSE2
API: Vulkan v: 1.4.335 drivers: nvidia surfaces: N/A
Info: Tools: api: clinfo, eglinfo, glxinfo, vulkaninfo
de: kscreen-console,kscreen-doctor gpu: nvidia-settings,nvidia-smi
wl: wayland-info x11: xdpyinfo, xprop, xrandr
Audio:
Device-1: NVIDIA TU106 High Definition Audio driver: snd_hda_intel
Device-2: Advanced Micro Devices [AMD] Starship/Matisse HD Audio
driver: snd_hda_intel
Device-3: Logitech C922 Pro Stream Webcam driver: snd-usb-audio,uvcvideo
type: USB
Device-4: Astro Gaming A50 driver: hid-generic,snd-usb-audio,usbhid
type: USB
API: ALSA v: k6.18.2-arch2-1 status: kernel-api
Network:
Device-1: Intel I211 Gigabit Network driver: igb
IF: enp4s0 state: up speed: 1000 Mbps duplex: full mac: b4:2e:99:f8:62:6b
Device-2: Intel 82575EB Gigabit Network driver: N/A
Device-3: Intel 82575EB Gigabit Network driver: N/A
Device-4: Realtek RTL8812AE 802.11ac PCIe Wireless Network Adapter
driver: rtl8821ae
IF: wlan0 state: down mac: a2:42:ae:cb:82:db
Drives:
Local Storage: total: 4.8 TiB used: 116.01 GiB (2.4%)
ID-1: /dev/nvme0n1 vendor: Samsung model: SSD 970 EVO Plus 1TB
size: 931.51 GiB
ID-2: /dev/sda vendor: Crucial model: CT275MX300SSD4 size: 256.17 GiB
ID-3: /dev/sdb vendor: Crucial model: CT4000MX500SSD1 size: 3.64 TiB
Partition:
ID-1: / size: 48.91 GiB used: 19.85 GiB (40.6%) fs: ext4 dev: /dev/sdb5
ID-2: /boot size: 1022 MiB used: 42.6 MiB (4.2%) fs: vfat dev: /dev/sdb3
ID-3: /home size: 210.56 GiB used: 96.12 GiB (45.6%) fs: ext4
dev: /dev/sdb6
Swap:
ID-1: swap-1 type: partition size: 34 GiB used: 0 KiB (0.0%) dev: /dev/sdb4
Sensors:
System Temperatures: cpu: 30.0 C mobo: 27.0 C
Fan Speeds (rpm): cpu: 754 fan-1: 0 fan-3: 751 fan-4: 0 fan-5: 1019
Info:
Memory: total: 32 GiB available: 31.26 GiB used: 5.63 GiB (18.0%)
Processes: 519 Uptime: 6h 47m Shell: Sudo inxi: 3.3.40Here are a few different relevant boots:
First where XMP was just entirely disabled - No PCIe failure:
http://0x0.st/Pozm.txt
Second where memory speed and primary timings were manually set - all other timings auto:
https://0x0.st/PozZ.txt
Current boot - left overnight with "random load" script to attempt to trigger PCIe failure.
For this boot, in an attempt to find more stable memory settings I also have tRRD_S-tRRD_L-tFAW set to 4-7-16 - I started with 4-6-16 which booted but resulted in a PCIe error into a lockup and reboot. At tRRD_L = 7 PCIe failed but no reboot occurred and I was able to recover the NIC connection. There is also no lag experienced if it's recoverable.
https://0x0.st/Poz8.txt
Last edited by dogbr3ath (2026-01-23 19:45:05)
Offline
Is you RAM on your motherboard QVL list? :
https://www.xda-developers.com/motherbo … andy-tool/
Offline
Is you RAM on your motherboard QVL list? :
https://www.xda-developers.com/motherbo … andy-tool/
Yes. Just not as the 32GB set. But BIOS manufacturers don’t check every single kit, and G skill has my board listed on their product page as compatible with the sticks I have.
Offline
i doubt ram has any play here
anyway: is the system stable with uefi default settings without any tuning at all?
as this is AM4: have you checked for bend pins when you reseated the cpu?
a ryzen 3000 in a 500 board smells like either a really strange setup or the board was upgrade at one time - or its just thrown together pieces from what was available at times - could actually be the cpu struggle to support the board which is meant for a 5000 cpu (yes, even a ryzen 1000 should work - but fun fact: there a lot of 500 boards not supporting ryzen 1000 for whatever reason)
Online
is the system stable with uefi default settings without any tuning at all?
I have had reboots or PCIe failures with the optimized defaults loaded. I am stripping down the PC (removing expansion cards, usb devices) and retesting with a better record of outcomes.
as this is AM4: have you checked for bend pins when you reseated the cpu?
I did yes
a ryzen 3000 in a 500 board smells like either a really strange setup or the board was upgrade at one time
I bought the 3900X on sale in I think Feb or so 2020 -- 5000 series chips were not out yet. The idea was to upgrade eventually. However, with 5000 series all hitting end of life I may have missed my window for an X3D model. I did quite a bit of research at the time of the purchase so there should not have been any compatibility issues. Reviews would have only been done with 3000 series cards at the time as well.
Offline
First where XMP was just entirely disabled - No PCIe failure:
Did you *ever* run into this w/ XMP disabled and/or can you cause it while XMP is disabled?
Offline
Did you *ever* run into this w/ XMP disabled and/or can you cause it while XMP is disabled?
I think I have in the past but was not documenting my testing well. So that's what I'm trying to do now. I have no consistent way of triggering it yet.
I ran my simulated load script for about an hour with optimized UEFI defaults loaded and no errors.
Stopped it and connected the internet and ran that for about 8.5 hours.
Added a monitor and let it run overnight + a bit longer (~12 hours).
I find I get mixed results from boot to boot, so I'll restart my system now and see how it behaves with the same settings for ~2-4 hours.
Update: After rebooting and a 5.5 hr session I'm still not seeing any PCIe errors. I'll try adding USB devices 1 at a time, then add my expansion NIC/wifi cards back one at a time as well.
05/01/26: Ran pacman -Syu, rebooted, and had a full days worth of work without errors. Reduced the kernel param for max_cstate to 5 from 1 for a slightly more energy efficient setup. Ran y-cruncher stressor alongside OCCT per r/overclocking official DDR4 RAM overclocking guide to stress the infinity fabric. No PCIe errors during the run (but there are some GPU related errors that are another issue but nothing crashed). As it's been stable for 3 days, I'm feeling confident I have a stable baseline. So I'll be adjusting both PBO and the RAM (setting it below XMP speeds) and testing from there. I am thinking I can possibly get my RAM to 3200MHz which should be plenty for my 3900X.
06/01/26: I've set MCLK to 3200 MHz and FCLK to 1600 MHz to match. I've also reconnected my NIC and numpad (was having USB/NIC dropouts). After a full 2 days of work/usage I haven't had any PCIe dropout errors, freezes or crashes. Next I'll remove some of my kernel params, and confirm stability before marking this as resolved.
23/01/26: Still no issues. Marking this as resolved and assuming the solution was downclocking my RAM to something officially supported by the CPU.
Last edited by dogbr3ath (2026-01-23 19:44:38)
Offline