You are not logged in.

#1 2023-06-30 09:57:47

sjdh
Member
Registered: 2013-09-19
Posts: 22

AER erros with nvida GPU's

Hi

When I install nvidia GPU's in my machine, AER errors keep coming in, like this

nvidia 0000:01:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000

I tried
• Bios versions 1.1, 3.5 (latest)
• Sveral PCI slots
• Several riser cables
• Several GPU's (Inno3d RTX  4090,  MSI RTX3090, Gigabyte RTX3090)

What could be the cause? And what can I try

My system

Linux 6.3.9-arch1-1 x86_64
Asrock rack  ROMED8-2T
AMD EPYC 7402P
Samsung 980PRO nvme
lspci -tv  | head
-+-[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex
 |           +-01.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
 |           +-01.1-[01]--+-00.0  NVIDIA Corporation AD102 [GeForce RTX 4090]
 |           |            \-00.1  NVIDIA Corporation AD102 High Definition Audio Controller
 |           +-02.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
 |           +-03.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
 |           +-03.5-[02]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
 |           +-04.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
 |           +-05.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
 |           +-07.0  Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge```


Error message

> dmesg
[ 1170.658532] {83}[Hardware Error]:   bridge: secondary_status: 0xc000, control: 0x0000                                                                                                                    [0/1849]
[ 1170.658533] {83}[Hardware Error]:  Error 1, type: corrected
[ 1170.658534] {83}[Hardware Error]:   section_type: PCIe error
[ 1170.658534] {83}[Hardware Error]:   port_type: 0, PCIe end point
[ 1170.658535] {83}[Hardware Error]:   version: 0.2
[ 1170.658536] {83}[Hardware Error]:   command: 0x0006, status: 0x0010
[ 1170.658537] {83}[Hardware Error]:   device_id: 0000:01:00.1
[ 1170.658537] {83}[Hardware Error]:   slot: 0
[ 1170.658538] {83}[Hardware Error]:   secondary_bus: 0x00
[ 1170.658539] {83}[Hardware Error]:   vendor_id: 0x10de, device_id: 0x22ba
[ 1170.658539] {83}[Hardware Error]:   class_code: 040300
[ 1170.658540] {83}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0000
[ 1170.658541] {83}[Hardware Error]:  Error 2, type: corrected
[ 1170.658541] {83}[Hardware Error]:   section_type: PCIe error
[ 1170.658542] {83}[Hardware Error]:   port_type: 1, legacy PCI end point
[ 1170.658543] {83}[Hardware Error]:   version: 0.2
[ 1170.658543] {83}[Hardware Error]:   command: 0x0407, status: 0x0010
[ 1170.658544] {83}[Hardware Error]:   device_id: 0000:01:00.0
[ 1170.658545] {83}[Hardware Error]:   slot: 0
[ 1170.658546] {83}[Hardware Error]:   secondary_bus: 0x00
[ 1170.658547] {83}[Hardware Error]:   vendor_id: 0x10de, device_id: 0x2684
[ 1170.658548] {83}[Hardware Error]:   class_code: 030000
[ 1170.658549] {83}[Hardware Error]:   bridge: secondary_status: 0xc000, control: 0x0000
[ 1170.658550] {83}[Hardware Error]:  Error 3, type: corrected
[ 1170.658550] {83}[Hardware Error]:   section_type: PCIe error
[ 1170.658551] {83}[Hardware Error]:   port_type: 0, PCIe end point
[ 1170.658552] {83}[Hardware Error]:   version: 0.2
[ 1170.658553] {83}[Hardware Error]:   command: 0x0006, status: 0x0010
[ 1170.658553] {83}[Hardware Error]:   device_id: 0000:01:00.1
[ 1170.658554] {83}[Hardware Error]:   slot: 0
[ 1170.658555] {83}[Hardware Error]:   secondary_bus: 0x00
[ 1170.658556] {83}[Hardware Error]:   vendor_id: 0x10de, device_id: 0x22ba
[ 1170.658556] {83}[Hardware Error]:   class_code: 040300
[ 1170.658557] {83}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0000
[ 1170.658695] nvidia 0000:01:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 1170.658736] nvidia 0000:01:00.0:    [ 0] RxErr                  (First)
[ 1170.658738] nvidia 0000:01:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 1170.658853] snd_hda_intel 0000:01:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 1170.658877] snd_hda_intel 0000:01:00.1:    [ 0] RxErr                  (First)
[ 1170.658879] snd_hda_intel 0000:01:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 1170.658947] nvidia 0000:01:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 1170.658964] nvidia 0000:01:00.0:    [ 0] RxErr                  (First)
[ 1170.658965] nvidia 0000:01:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 1170.659081] snd_hda_intel 0000:01:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 1170.659097] snd_hda_intel 0000:01:00.1:    [ 0] RxErr                  (First)
[ 1170.659098] snd_hda_intel 0000:01:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID

Hardware details

> inxi -F
System:
  Host: *** Kernel: 6.3.9-arch1-1 arch: x86_64 bits: 64 Console: pty pts/5 (vt 1)
    Distro: Arch Linux
Machine:
  Type: Server Mobo: ASRockRack model: ROMED8-2T serial: <superuser required>
    UEFI: American Megatrends v: P3.50 date: 07/19/2022
CPU:
  Info: 24-core model: AMD EPYC 7402P bits: 64 type: MT MCP cache: L2: 12 MiB
  Speed (MHz): avg: 1572 min/max: 1500/2800 cores: 1: 2800 2: 1500 3: 1500 4: 1500 5: 1500
    6: 1500 7: 1500 8: 1500 9: 1500 10: 1500 11: 1500 12: 1500 13: 1500 14: 2400 15: 1500 16: 2800
    17: 1500 18: 1500 19: 1500 20: 1500 21: 1500 22: 1500 23: 1500 24: 1500 25: 1500 26: 1500
    27: 1500 28: 1500 29: 1500 30: 1500 31: 1500 32: 1500 33: 1500 34: 1500 35: 1500 36: 1500
    37: 1500 38: 1500 39: 1500 40: 1500 41: 1500 42: 1500 43: 1500 44: 1500 45: 1500 46: 1500
    47: 1500 48: 1500
Graphics:
  Device-1: NVIDIA AD102 [GeForce RTX 4090] driver: nvidia v: 535.54.03
  Device-2: ASPEED Graphics Family driver: ast v: kernel
  Display: server: X.org v: 1.21.1.8 driver: gpu: ast tty: 105x47 resolution: 1920x1080
  API: OpenGL Message: GL data unavailable in console and glxinfo missing.
Audio:
  Device-1: NVIDIA AD102 High Definition Audio driver: snd_hda_intel
  API: ALSA v: k6.3.9-arch1-1 status: kernel-api
Network:
  Device-1: Intel Ethernet X550 driver: ixgbe
  IF: eno1 state: up speed: 1000 Mbps duplex: full mac: d0:50:99:dc:94:69
  Device-2: Intel Ethernet X550 driver: ixgbe
  IF: eno2 state: up speed: 1000 Mbps duplex: full mac: d0:50:99:dc:94:6a
  IF-ID-1: br-8aa902e35839 state: down mac: 02:42:22:e3:f9:3e
  IF-ID-2: br-ac3821849b62 state: up speed: 10000 Mbps duplex: unknown mac: 02:42:9e:66:08:10
  IF-ID-3: docker0 state: down mac: 02:42:c5:ea:bd:4a
  IF-ID-4: vethcc9692c state: up speed: 10000 Mbps duplex: full mac: 0a:95:91:71:05:d1
  IF-ID-5: vethfc68e9e state: up speed: 10000 Mbps duplex: full mac: 42:ef:6c:8b:06:f9
Drives:
  Local Storage: total: 945.92 GiB used: 200.09 GiB (21.2%)
  ID-1: /dev/nvme0n1 vendor: Samsung model: SSD 980 PRO 1TB size: 931.51 GiB
  ID-2: /dev/sda vendor: Kingston model: DataTraveler 3.0 size: 14.41 GiB type: USB
Partition:
  ID-1: / size: 884.32 GiB used: 200 GiB (22.6%) fs: ext4 dev: /dev/nvme0n1p3
  ID-2: /boot size: 341.3 MiB used: 89.2 MiB (26.1%) fs: vfat dev: /dev/nvme0n1p1
Swap:
  ID-1: swap-1 type: partition size: 31.65 GiB used: 0 KiB (0.0%) dev: /dev/nvme0n1p2
Sensors:
  Src: ipmi Permissions: Unable to run ipmi sensors. Root privileges required.
  Src: lm-sensors System Temperatures: cpu: 32.5 C mobo: 37.0 C
  Fan Speeds (RPM): fan-1: 0 fan-2: 0 fan-3: 0 fan-4: 0 fan-5: 0
Info:
  Processes: 565 Uptime: 58m Memory: available: 314.57 GiB used: 4.02 GiB (1.3%) Init: systemd
  Shell: fish inxi: 3.3.27

Offline

#2 2023-07-01 12:00:09

Lone_Wolf
Administrator
From: Netherlands, Europe
Registered: 2005-10-04
Posts: 15,038

Re: AER erros with nvida GPU's

Your motherboard comes with an ASPEED AST2500 graphics controller .

According to the manual the firmware has an option to select between the aspeed and external vga .
In firmware setup under advanced you should see OnBrd/Ext VGA Select .

What is it set to and does changing it make a difference ?


Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.

clean chroot building not flexible enough ?
Try clean chroot manager by graysky

Offline

#3 2023-07-01 19:12:07

sjdh
Member
Registered: 2013-09-19
Posts: 22

Re: AER erros with nvida GPU's

Thanks for checking.

I am using the OnBrd. Switching to EXT VGA did not make a difference.

Offline

#4 2023-07-02 09:48:29

Lone_Wolf
Administrator
From: Netherlands, Europe
Registered: 2005-10-04
Posts: 15,038

Re: AER erros with nvida GPU's

Please post full dfmesg as well as your /etc/mkinitcpio.conf file.


Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.

clean chroot building not flexible enough ?
Try clean chroot manager by graysky

Offline

Board footer

Powered by FluxBB