You are not logged in.
Hi,
I have some issues with my recently acquired Dell Precision 3510 running vanilla Arch Linux. This system is equipped with XPG GAMMIX M.2 NVMe SSD.
The major issue is lots of PCIe errors in the dmesg, which seems to be corrected and thus benign, like so:
[ 367.133164] pcieport 0000:00:1d.0: AER: Corrected error received: 0000:00:1d.0
[ 367.133174] pcieport 0000:00:1d.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[ 367.133179] pcieport 0000:00:1d.0: AER: device [8086:a118] error status/mask=00000001/00002000
[ 367.133183] pcieport 0000:00:1d.0: AER: [ 0] RxErr (First)
[ 398.055258] pcieport 0000:00:1d.0: AER: Multiple Corrected error received: 0000:00:1d.0
[ 398.166636] pcieport 0000:00:1d.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[ 398.166638] pcieport 0000:00:1d.0: AER: device [8086:a118] error status/mask=00002001/00002000
[ 398.166639] pcieport 0000:00:1d.0: AER: [ 0] RxErr (First)
[ 398.166640] pcieport 0000:00:1d.0: AER: Error of this Agent is reported first
[ 398.166661] nvme 0000:04:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[ 398.166662] nvme 0000:04:00.0: AER: device [10ec:5762] error status/mask=00000001/00006000
[ 398.166664] nvme 0000:04:00.0: AER: [ 0] RxErr (First)
[ 423.441203] pcieport 0000:00:1d.0: AER: Multiple Corrected error received: 0000:00:1d.0
[ 423.552328] pcieport 0000:00:1d.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[ 423.552330] pcieport 0000:00:1d.0: AER: device [8086:a118] error status/mask=00002001/00002000
[ 423.552331] pcieport 0000:00:1d.0: AER: [ 0] RxErr (First)
[ 423.552346] pcieport 0000:00:1d.0: AER: Error of this Agent is reported first
[ 423.552349] nvme 0000:04:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[ 423.552351] nvme 0000:04:00.0: AER: device [10ec:5762] error status/mask=00000001/00006000
[ 423.552352] nvme 0000:04:00.0: AER: [ 0] RxErr (First)
But every now and then my PC freezes for several seconds. After that time it comes back to life and this can be seen in the dmesg:
[ 2686.436788] nvme nvme0: I/O 768 QID 1 timeout, aborting
[ 2686.568963] nvme nvme0: Abort status: 0x0
[ 3104.569357] nvme nvme0: I/O 786 QID 1 timeout, aborting
[ 3104.701442] nvme nvme0: Abort status: 0x0
[ 3196.729120] nvme nvme0: I/O 768 QID 1 timeout, aborting
[ 3196.861227] nvme nvme0: Abort status: 0x0
[ 3275.235571] nvme nvme0: I/O 535 QID 2 timeout, aborting
[ 3275.367650] nvme nvme0: Abort status: 0x0
EDIT: Sometimes these timeouts appear in dmesg without freezing up the pc
I tired to use pci=nomsi, as suggested in the interwebs. However, if I do that, the PC runs significantly hotter (idle CPU temp: 55*C with pci=nomsi, 38*C without). Since this machine is my main productivity machine (music production), I need MSI's (together with RTIRQ) for achieving lower latency without any problems (and this actually works fine )
I also used pci=noaer which suppresses the error reports, but the issue with freezing (I/O aborted due to timeout) still occurs, but it's somewhat less frequent.
This issue shows up even as early as during the Arch Linux installation, however no data is ever lost, filesystem is consistent. BTW, RTIRQ and threaded irqs don't seem to influence this issue at all.
Moreover, I got two hard freezes today (blinking caps lock led - kernel panic), unfortunately I was unable to determine the cause.
There are no freezes whatsoever when running Win10 on this machine. The NVMe SSD drive is new, with several hours runtime reported by smartctl. I also updated the BIOS firmware to the newest one from Dell (1.24.4) for this machine. There are no updates to the SSD firmware.
I kindly ask for your help. I've been using Arch on much older Dell machines for ages without any problems. What can I do to solve this issue?
How can I determine the cause of the panics when no dump has been displayed (running graphical session with dmesg -w on the console). Journal just stops at the freeze time, no info there
Thanks for any help in advance.
Here are some info about the pc:
[cristos@momentvm ~] $ uname -a
Linux momentvm 5.2.13-arch1-1-ARCH #1 SMP PREEMPT Fri Sep 6 17:52:33 UTC 2019 x86_64 GNU/Linux
[cristos@momentvm ~] $ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-linux root=UUID=685aeb9c-5d62-4163-8bf1-e140320e4666 rw radeon.si_support=1 amdgpu.si_support=0 pci=noaer threadirqs loglevel=3
Note: radeon and amdgpu parameters were my attempt to solve another issue with this machine, posted on this bbs as well
Interrupts:
[cristos@momentvm ~] $ cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
0: 9 0 0 0 0 0 0 0 IR-IO-APIC 2-edge timer
1: 11189 0 0 2957 0 0 0 0 IR-IO-APIC 1-edge i8042
8: 0 0 0 0 0 0 1 0 IR-IO-APIC 8-edge rtc0
9: 490 470 0 0 0 0 0 0 IR-IO-APIC 9-fasteoi acpi
12: 233599 0 109348 0 0 0 0 0 IR-IO-APIC 12-edge i8042
16: 0 0 0 0 0 0 0 0 IR-IO-APIC 16-fasteoi i801_smbus
31: 0 0 0 0 0 0 0 0 IR-IO-APIC 31-edge smo8800
120: 0 0 0 0 0 0 0 0 DMAR-MSI 0-edge dmar0
121: 0 0 0 0 0 0 0 0 DMAR-MSI 1-edge dmar1
122: 0 0 0 0 0 0 0 0 IR-PCI-MSI 16384-edge PCIe PME
123: 0 0 0 0 0 0 0 0 IR-PCI-MSI 458752-edge PCIe PME
124: 0 0 0 0 0 0 0 0 IR-PCI-MSI 462848-edge PCIe PME
125: 0 0 0 0 0 0 0 0 IR-PCI-MSI 475136-edge PCIe PME
126: 0 43 0 0 0 0 0 0 IR-PCI-MSI 360448-edge mei_me
127: 0 0 4894 0 0 0 0 0 IR-PCI-MSI 327680-edge xhci_hcd
128: 0 0 0 23 0 0 0 0 IR-PCI-MSI 2097152-edge nvme0q0
129: 0 0 0 0 149 0 0 0 IR-PCI-MSI 514048-edge snd_hda_intel:card1
130: 0 0 0 0 0 1110 0 0 IR-PCI-MSI 520192-edge enp0s31f6
131: 0 0 0 0 1175 0 0 0 IR-PCI-MSI 2097153-edge nvme0q1
132: 0 304 0 0 0 0 0 0 IR-PCI-MSI 2097154-edge nvme0q2
133: 0 0 388 0 0 0 0 0 IR-PCI-MSI 2097155-edge nvme0q3
134: 38 42 0 0 0 0 0 0 IR-PCI-MSI 524288-edge radeon
135: 13 0 0 0 22 0 0 0 IR-PCI-MSI 1572864-edge rtsx_pci
136: 0 0 0 427 0 0 0 0 IR-PCI-MSI 2097156-edge nvme0q4
137: 0 0 0 0 0 0 0 0 IR-PCI-MSI 376832-edge ahci[0000:00:17.0]
138: 0 0 0 0 0 288 0 0 IR-PCI-MSI 2097157-edge nvme0q5
139: 0 0 0 0 0 0 202 0 IR-PCI-MSI 2097158-edge nvme0q6
140: 4288 0 0 0 3169 0 0 0 IR-PCI-MSI 1048576-edge iwlwifi
141: 107122 0 0 0 0 52512 0 0 IR-PCI-MSI 32768-edge i915
142: 0 0 0 0 0 0 0 309 IR-PCI-MSI 2097159-edge nvme0q7
NMI: 1 3 3 2 2 2 2 2 Non-maskable interrupts
LOC: 198058 146347 153115 130486 135367 137948 135278 134512 Local timer interrupts
SPU: 0 0 0 0 0 0 0 0 Spurious interrupts
PMI: 1 3 3 2 2 2 2 2 Performance monitoring interrupts
IWI: 526 6 8 10 7 129 9 5 IRQ work interrupts
RTR: 6 0 0 0 0 0 0 0 APIC ICR read retries
RES: 29969 19686 11233 7545 6791 6101 5586 5743 Rescheduling interrupts
CAL: 10124 10589 10611 11430 9431 10948 11996 10447 Function call interrupts
TLB: 12604 13067 13250 13621 10154 13521 14177 13371 TLB shootdowns
TRM: 0 0 0 0 0 0 0 0 Thermal event interrupts
THR: 0 0 0 0 0 0 0 0 Threshold APIC interrupts
DFR: 0 0 0 0 0 0 0 0 Deferred Error APIC interrupts
MCE: 0 0 0 0 0 0 0 0 Machine check exceptions
MCP: 12 13 13 13 13 13 13 13 Machine check polls
HYP: 0 0 0 0 0 0 0 0 Hypervisor callback interrupts
HRE: 0 0 0 0 0 0 0 0 Hyper-V reenlightenment interrupts
HVS: 0 0 0 0 0 0 0 0 Hyper-V stimer0 interrupts
ERR: 2
MIS: 0
PIN: 0 0 0 0 0 0 0 0 Posted-interrupt notification event
NPI: 0 0 0 0 0 0 0 0 Nested posted-interrupt event
PIW: 0 0 0 0 0 0 0 0 Posted-interrupt wakeup event
Smartctl output:
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.2.13-arch1-1-ARCH] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: XPG GAMMIX S5
Serial Number: 2J2020044509
Firmware Version: V9001b31
PCI Vendor/Subsystem ID: 0x10ec
IEEE OUI Identifier: 0x00e04c
Controller ID: 1
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1,024,209,543,168 [1.02 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 525433 334e323035
Local Time is: Thu Sep 12 18:45:28 2019 CEST
Firmware Updates (0x02): 1 Slot
Optional Admin Commands (0x0006): Format Frmw_DL
Optional NVM Commands (0x0014): DS_Mngmt Sav/Sel_Feat
Maximum Data Transfer Size: 32 Pages
Warning Comp. Temp. Threshold: 118 Celsius
Critical Comp. Temp. Threshold: 150 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 50.00W - - 0 0 0 0 0 0
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 35 Celsius
Available Spare: 100%
Available Spare Threshold: 32%
Percentage Used: 0%
Data Units Read: 1,344,890 [688 GB]
Data Units Written: 2,102,462 [1.07 TB]
Host Read Commands: 9,151,608
Host Write Commands: 10,553,286
Controller Busy Time: 0
Power Cycles: 35
Power On Hours: 16
Unsafe Shutdowns: 19
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Error Information (NVMe Log 0x01, max 8 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 1 0 0x0000 0x0000 0x000 0 0 -
6 1219368206019409729 0 0x0000 0x0000 0x000 0 0 -
Modules:
[cristos@momentvm ~] $ lsmod
Module Size Used by
ccm 20480 3
joydev 28672 0
mousedev 24576 0
snd_hda_codec_hdmi 69632 1
arc4 16384 2
intel_rapl 28672 0
x86_pkg_temp_thermal 20480 0
intel_powerclamp 20480 0
coretemp 20480 0
kvm_intel 311296 0
uvcvideo 114688 0
snd_hda_codec_realtek 126976 1
videobuf2_vmalloc 20480 1 uvcvideo
iwlmvm 466944 0
videobuf2_memops 20480 1 videobuf2_vmalloc
snd_hda_codec_generic 94208 1 snd_hda_codec_realtek
kvm 770048 1 kvm_intel
videobuf2_v4l2 28672 1 uvcvideo
irqbypass 16384 1 kvm
iTCO_wdt 16384 0
videobuf2_common 57344 2 videobuf2_v4l2,uvcvideo
nls_iso8859_1 16384 1
nls_cp437 20480 1
snd_hda_intel 53248 3
crct10dif_pclmul 16384 1
dell_wmi 20480 0
crc32_pclmul 16384 0
mac80211 999424 1 iwlmvm
mei_hdcp 24576 0
mei_wdt 16384 0
iTCO_vendor_support 16384 1 iTCO_wdt
intel_wmi_thunderbolt 20480 0
wmi_bmof 16384 0
vfat 24576 1
dell_laptop 28672 0
sparse_keymap 16384 1 dell_wmi
ghash_clmulni_intel 16384 0
fat 86016 1 vfat
ledtrig_audio 16384 3 snd_hda_codec_generic,snd_hda_codec_realtek,dell_laptop
snd_usb_audio 278528 0
videodev 237568 3 videobuf2_v4l2,uvcvideo,videobuf2_common
ppdev 24576 0
snd_hda_codec 159744 4 snd_hda_codec_generic,snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec_realtek
aesni_intel 372736 2
snd_usbmidi_lib 40960 1 snd_usb_audio
iwlwifi 389120 1 iwlmvm
dell_smbios 32768 2 dell_wmi,dell_laptop
snd_rawmidi 45056 1 snd_usbmidi_lib
dell_wmi_descriptor 20480 2 dell_wmi,dell_smbios
btusb 57344 0
snd_hda_core 102400 5 snd_hda_codec_generic,snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec,snd_hda_codec_realtek
aes_x86_64 20480 1 aesni_intel
i915 2265088 32
crypto_simd 16384 1 aesni_intel
btrtl 20480 1 btusb
cryptd 24576 2 crypto_simd,ghash_clmulni_intel
dcdbas 20480 1 dell_smbios
snd_seq_device 16384 1 snd_rawmidi
glue_helper 16384 1 aesni_intel
dell_smm_hwmon 20480 0
intel_cstate 16384 0
btbcm 16384 1 btusb
media 61440 5 videodev,snd_usb_audio,videobuf2_v4l2,uvcvideo,videobuf2_common
btintel 28672 1 btusb
intel_uncore 139264 0
snd_hwdep 20480 2 snd_usb_audio,snd_hda_codec
intel_rapl_perf 16384 0
psmouse 180224 0
bluetooth 675840 5 btrtl,btintel,btbcm,btusb
pcspkr 16384 0
input_leds 16384 0
e1000e 286720 0
cfg80211 856064 3 iwlmvm,iwlwifi,mac80211
snd_pcm 135168 5 snd_hda_codec_hdmi,snd_hda_intel,snd_usb_audio,snd_hda_codec,snd_hda_core
rtsx_pci_ms 24576 0
i2c_i801 36864 0
snd_timer 40960 1 snd_pcm
memstick 20480 1 rtsx_pci_ms
mei_me 45056 2
ecdh_generic 16384 1 bluetooth
snd 110592 18 snd_hda_codec_generic,snd_seq_device,snd_hda_codec_hdmi,snd_hwdep,snd_hda_intel,snd_usb_audio,snd_usbmidi_lib,snd_hda_codec,snd_hda_codec_realtek,snd_timer,snd_pcm,snd_rawmidi
tpm_crb 20480 0
soundcore 16384 1 snd
ecc 32768 1 ecdh_generic
pcc_cpufreq 20480 0
mei 126976 5 mei_wdt,mei_hdcp,mei_me
processor_thermal_device 20480 0
intel_gtt 24576 1 i915
intel_soc_dts_iosf 20480 1 processor_thermal_device
intel_pch_thermal 16384 0
wmi 36864 5 intel_wmi_thunderbolt,dell_wmi,wmi_bmof,dell_smbios,dell_wmi_descriptor
parport_pc 53248 0
parport 61440 2 parport_pc,ppdev
evdev 24576 16
tpm_tis 16384 0
mac_hid 16384 0
tpm_tis_core 24576 1 tpm_tis
dell_smo8800 20480 0
tpm 73728 3 tpm_tis,tpm_crb,tpm_tis_core
int3400_thermal 20480 0
dell_rbtn 20480 0
int3403_thermal 16384 0
battery 24576 0
rfkill 28672 9 bluetooth,dell_laptop,dell_rbtn,cfg80211
int340x_thermal_zone 16384 2 int3403_thermal,processor_thermal_device
acpi_thermal_rel 16384 1 int3400_thermal
rng_core 16384 1 tpm
ac 16384 0
sg 40960 0
crypto_user 16384 0
ip_tables 36864 0
x_tables 49152 1 ip_tables
ext4 770048 2
crc32c_generic 16384 0
crc16 16384 2 bluetooth,ext4
mbcache 16384 1 ext4
jbd2 135168 1 ext4
rtsx_pci_sdmmc 32768 0
serio_raw 20480 0
mmc_core 184320 1 rtsx_pci_sdmmc
atkbd 36864 0
libps2 20480 2 atkbd,psmouse
ahci 40960 0
libahci 40960 1 ahci
libata 282624 2 libahci,ahci
xhci_pci 20480 0
crc32c_intel 24576 3
xhci_hcd 278528 1 xhci_pci
scsi_mod 249856 2 libata,sg
rtsx_pci 81920 2 rtsx_pci_sdmmc,rtsx_pci_ms
i8042 32768 1 dell_laptop
serio 28672 6 serio_raw,atkbd,psmouse,i8042
radeon 1646592 1
amdgpu 3973120 0
gpu_sched 36864 1 amdgpu
i2c_algo_bit 16384 3 amdgpu,radeon,i915
drm_kms_helper 225280 3 amdgpu,radeon,i915
syscopyarea 16384 1 drm_kms_helper
sysfillrect 16384 1 drm_kms_helper
sysimgblt 16384 1 drm_kms_helper
fb_sys_fops 16384 1 drm_kms_helper
ttm 118784 2 amdgpu,radeon
drm 503808 17 gpu_sched,drm_kms_helper,amdgpu,radeon,i915,ttm
agpgart 53248 3 intel_gtt,ttm,drm
Lspci for the drive:
04:00.0 Non-Volatile memory controller: Realtek Semiconductor Co., Ltd. Device 5762 (rev ec) (prog-if 10)
Subsystem: Realtek Semiconductor Co., Ltd. Device 5762
Control: I/O+ Mem+ BusMaster+ SpecCycle+ MemWINV+ VGASnoop+ ParErr+ Stepping+ SERR+ FastB2B+ DisINTx+
Status: Cap+ 66MHz+ UDF+ FastB2B+ ParErr+ DEVSEL=?? >TAbort+ <TAbort+ <MAbort+ >SERR+ <PERR+ INTx+
Latency: 0
Interrupt: pin A routed to IRQ 16
NUMA node: 0
Region 0: Memory at ef000000 (64-bit, non-prefetchable) [size=16K]
Region 5: Memory at ef004000 (32-bit, non-prefetchable) [size=8K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable- Count=1/8 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [70] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 25.000W
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s (ok), Width x4 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR+, OBFF Via message/WAKE#
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
AtomicOpsCtl: ReqEn-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [b0] MSI-X: Enable+ Count=8 Masked-
Vector table: BAR=0 offset=00002000
PBA: BAR=0 offset=00003000
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr+ BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [148 v1] Device Serial Number 00-00-00-01-00-4c-e0-00
Capabilities: [158 v1] Secondary PCI Express <?>
Capabilities: [178 v1] Latency Tolerance Reporting
Max snoop latency: 3145728ns
Max no snoop latency: 3145728ns
Capabilities: [180 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
PortCommonModeRestoreTime=60us PortTPowerOnTime=60us
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
T_CommonMode=0us LTR1.2_Threshold=0ns
L1SubCtl2: T_PwrOn=10us
Kernel driver in use: nvme
Last edited by cristos666 (2019-09-24 05:50:31)
Offline
Post full lspci -k and journalctl -b output please.
For the journal output you should use a pastebin client
Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.
(A works at time B) && (time C > time B ) ≠ (A works at time C)
Offline
Hi,
I can now freeze my PC to a kernel panic on demand.
It happens when I start 10 file transfers over FTP with FileZilla - literally 200k of small files (around a 1M or less). Then, after a while (about 2-3mins) this happens:
https://drive.google.com/open?id=13MLaq … q_EPOum-EN
(sorry posting a link to a photo, but this was the only way of catching the panic)
After a reboot the drive is not detected by the system, another power cycle is needed for the system to see it again. A drive failing?
Here's my lspci -k output
[cristos@momentvm ~] $ lspci -k
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Host Bridge/DRAM Registers (rev 07)
Subsystem: Dell Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Host Bridge/DRAM Registers
Kernel driver in use: skl_uncore
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 07)
Kernel driver in use: pcieport
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 530 (rev 06)
DeviceName: Onboard IGD
Subsystem: Dell HD Graphics 530
Kernel driver in use: i915
Kernel modules: i915
00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 07)
Subsystem: Dell Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem
Kernel driver in use: proc_thermal
Kernel modules: processor_thermal_device
00:14.0 USB controller: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller (rev 31)
Subsystem: Dell 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller
Kernel driver in use: xhci_hcd
Kernel modules: xhci_pci
00:14.2 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Thermal Subsystem (rev 31)
Subsystem: Dell 100 Series/C230 Series Chipset Family Thermal Subsystem
Kernel driver in use: intel_pch_thermal
Kernel modules: intel_pch_thermal
00:16.0 Communication controller: Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1 (rev 31)
Subsystem: Dell 100 Series/C230 Series Chipset Family MEI Controller
Kernel driver in use: mei_me
Kernel modules: mei_me
00:17.0 SATA controller: Intel Corporation Q170/Q150/B150/H170/H110/Z170/CM236 Chipset SATA Controller [AHCI Mode] (rev 31)
Subsystem: Dell Q170/Q150/B150/H170/H110/Z170/CM236 Chipset SATA Controller [AHCI Mode]
Kernel driver in use: ahci
Kernel modules: ahci
00:1c.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #2 (rev f1)
Kernel driver in use: pcieport
00:1c.2 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #3 (rev f1)
Kernel driver in use: pcieport
00:1d.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #9 (rev f1)
Kernel driver in use: pcieport
00:1f.0 ISA bridge: Intel Corporation CM236 Chipset LPC/eSPI Controller (rev 31)
Subsystem: Dell CM236 Chipset LPC/eSPI Controller
00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31)
Subsystem: Dell 100 Series/C230 Series Chipset Family Power Management Controller
00:1f.3 Audio device: Intel Corporation 100 Series/C230 Series Chipset Family HD Audio Controller (rev 31)
Subsystem: Dell 100 Series/C230 Series Chipset Family HD Audio Controller
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31)
Subsystem: Dell 100 Series/C230 Series Chipset Family SMBus
Kernel driver in use: i801_smbus
Kernel modules: i2c_i801
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-LM (rev 31)
Subsystem: Dell Ethernet Connection (2) I219-LM
Kernel driver in use: e1000e
Kernel modules: e1000e
01:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Cape Verde PRO / Venus LE / Tropo PRO-L [Radeon HD 8830M / R7 250 / R7 M465X] (rev 87)
Subsystem: Dell Cape Verde PRO / Venus LE / Tropo PRO-L [Radeon HD 8830M / R7 250 / R7 M465X]
Kernel driver in use: amdgpu
Kernel modules: radeon, amdgpu
02:00.0 Network controller: Intel Corporation Wireless 8260 (rev 3a)
Subsystem: Intel Corporation Wireless 8260
Kernel driver in use: iwlwifi
Kernel modules: iwlwifi
03:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader (rev 01)
Subsystem: Dell RTS525A PCI Express Card Reader
Kernel driver in use: rtsx_pci
Kernel modules: rtsx_pci
04:00.0 Non-Volatile memory controller: Realtek Semiconductor Co., Ltd. Device 5762 (rev 01)
Subsystem: Realtek Semiconductor Co., Ltd. Device 5762
Kernel driver in use: nvme
Here's my journal -b:
Offline
nvme_core.default_ps_max_latency_us=5500
Add this to your kernel parameters and try if it changes anything for you please.
This is mentioned in multiple places due to buggy low power state of nvme drives.
Offline
I also saw several reports on that, however, the value 5500us was for certain Samsung drives.
However, following https://wiki.archlinux.org/index.php/So … drive/NVMe I looked into my drive using nvme-cli.
[cristos@momentvm ~] $ sudo nvme get-feature -f 0x0c /dev/nvme0
get-feature:0xc (Autonomous Power State Transition), Current value:0x000001
0 1 2 3 4 5 6 7 8 9 a b c d e f
0000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
So, APST is enabled, yet there are no non-zero values in the table, so, according to the wiki:
If APST is enabled but no non-zero states appear in the table, the latencies might be too high for any states to be enabled by default. The output of # nvme id-ctrl /dev/nvme[0-9] should show the available non-operational power states of the NVME controller. If the total latency of any state (enlat + xlat) is greater than 25000 (25ms) you must pass a value at least that high as parameter default_ps_max_latency_us for the nvme_core kernel module. This should enable APST and make the table in # nvme get-feature show the entries.
Here's the output of the command:
[cristos@momentvm ~] $ sudo nvme id-ctrl /dev/nvme0n1
NVME Identify Controller:
vid : 0x10ec
ssvid : 0x10ec
sn : 2J2020044509
mn : XPG GAMMIX S5
fr : V9001b31
rab : 0
ieee : 00e04c
cmic : 0
mdts : 5
cntlid : 0x1
ver : 0x10300
rtd3r : 0xf4240
rtd3e : 0x1e8480
oaes : 0
ctratt : 0
rrls : 0
crdt1 : 0
crdt2 : 0
crdt3 : 0
oacs : 0x6
acl : 0
aerl : 3
frmw : 0x2
lpa : 0x3
elpe : 7
npss : 0
avscc : 0x1
apsta : 0x1
wctemp : 391
cctemp : 423
mtfa : 20
hmpre : 16384
hmmin : 16384
tnvmcap : 0
unvmcap : 0
rpmbs : 0
edstt : 0
dsto : 0
fwug : 1
kas : 0
hctma : 0
mntmt : 0
mxtmt : 0
sanicap : 0
hmminds : 0
hmmaxd : 0
nsetidmax : 0
anatt : 0
anacap : 0
anagrpmax : 0
nanagrpid : 0
sqes : 0x66
cqes : 0x44
maxcmd : 0
nn : 1
oncs : 0x14
fuses : 0
fna : 0
vwc : 0x1
awun : 0
awupf : 0
nvscc : 1
nwpc : 0
acwu : 0
sgls : 0
mnan : 0
subnqn : nqn.2018-05.com.example:nvme:nvm-subsystem-OUI00E04C
ioccsz : 0
iorcsz : 0
icdoff : 0
ctrattr : 0
msdbd : 0
ps 0 : mp:50.00W operational enlat:0 exlat:0 rrt:0 rrl:0
rwt:0 rwl:0 idle_power:- active_power:-
Interestingly, there are no powerstates defined by the controller. I tried disabling APST altogether by issuing nvme_core.default_ps_max_latency_us=0 in the kernel command line. Used nvme-cli get-feature to confirm APST is disabled and commenced the test.
I got kernel panic after 3-4 minutes of downloading small files - so nothing changed.
I will also try extremely high latency of 500000 (the default value is 10000) to check if the states table gets populated and thus solving the problem (as a long shot).
Anything else I can try to make it work?
Thanks!
Offline
As I expected, nothing changed with 500000us timeout. I managed to catch the missing part of the dump - it was a General Protection Fault, find the links to imgs below. (sorry for the poor quality, these are single frames extracted from a 60fps mobile phone movie )
https://drive.google.com/open?id=1qlBEF … AARdlEUYYW
https://drive.google.com/open?id=1fWUG- … EYxlV7kdhv
I also noticed an interesting behavior - after each crash power cycling won't wake it up. Only performing an onboard Diagnostics drive test (which fails due to no drive being detected) and then a reboot will. WTH?
Is it feasible to file an upstream bug with all the data here?
Thanks!
Last edited by cristos666 (2019-09-13 17:08:39)
Offline
It is not only Samsung, but it popped up in other places as well, e.g.: https://community.wd.com/t/linux-suppor … 8/225446/9
Anyway you did test it without success. Thanks, it would have killed me not knowing! :-)
I have seen the same behavior on some GPUs (mostly AMD) that suffer the reset bug.
You have to shutdown the machine and power back up and not just reboot, because some internal state is not flushed out. If you could boot I would suspect that you actually would receive some PCIe header errors.
Last edited by Swiggles (2019-09-13 18:30:08)
Offline
Ok, I got some clarifications on the topic.
I returned the PC since I suspect a defective motherboard and/or NVMe drive.
The errors also appeared on Fedora and Windows as well, albeit after some uptime, not as immediately as in Arch.
The symptoms were exactly the same - freezes for a few seconds and either "corrected PCIe errors" (when Inter RST was disabled) or "I/O errors/timeouts/retries" (when RST was enabled) were in the logs.
I guess this topic can be closed now. Thanks for help!
Offline
Please mark it solved or workarounded or some such by editing your initial posts subject line.
Offline