You are not logged in.

#1 2019-09-12 16:52:57

cristos666
Member
Registered: 2019-09-12
Posts: 8

[CLOSED] PCIe issues, NVMe storage I/O timeout on vanilla Arch kernel

Hi,

I have some issues with my recently acquired Dell Precision 3510 running vanilla Arch Linux. This system is equipped with XPG GAMMIX M.2 NVMe SSD.

The major issue is lots of PCIe errors in the dmesg, which seems to be corrected and thus benign, like so:

[  367.133164] pcieport 0000:00:1d.0: AER: Corrected error received: 0000:00:1d.0
[  367.133174] pcieport 0000:00:1d.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[  367.133179] pcieport 0000:00:1d.0: AER:   device [8086:a118] error status/mask=00000001/00002000
[  367.133183] pcieport 0000:00:1d.0: AER:    [ 0] RxErr                  (First)
[  398.055258] pcieport 0000:00:1d.0: AER: Multiple Corrected error received: 0000:00:1d.0
[  398.166636] pcieport 0000:00:1d.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[  398.166638] pcieport 0000:00:1d.0: AER:   device [8086:a118] error status/mask=00002001/00002000
[  398.166639] pcieport 0000:00:1d.0: AER:    [ 0] RxErr                  (First)
[  398.166640] pcieport 0000:00:1d.0: AER:   Error of this Agent is reported first
[  398.166661] nvme 0000:04:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[  398.166662] nvme 0000:04:00.0: AER:   device [10ec:5762] error status/mask=00000001/00006000
[  398.166664] nvme 0000:04:00.0: AER:    [ 0] RxErr                  (First)
[  423.441203] pcieport 0000:00:1d.0: AER: Multiple Corrected error received: 0000:00:1d.0
[  423.552328] pcieport 0000:00:1d.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[  423.552330] pcieport 0000:00:1d.0: AER:   device [8086:a118] error status/mask=00002001/00002000
[  423.552331] pcieport 0000:00:1d.0: AER:    [ 0] RxErr                  (First)
[  423.552346] pcieport 0000:00:1d.0: AER:   Error of this Agent is reported first
[  423.552349] nvme 0000:04:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[  423.552351] nvme 0000:04:00.0: AER:   device [10ec:5762] error status/mask=00000001/00006000
[  423.552352] nvme 0000:04:00.0: AER:    [ 0] RxErr                  (First)

But every now and then my PC freezes for several seconds. After that time it comes back to life and this can be seen in the dmesg:

[ 2686.436788] nvme nvme0: I/O 768 QID 1 timeout, aborting
[ 2686.568963] nvme nvme0: Abort status: 0x0
[ 3104.569357] nvme nvme0: I/O 786 QID 1 timeout, aborting
[ 3104.701442] nvme nvme0: Abort status: 0x0
[ 3196.729120] nvme nvme0: I/O 768 QID 1 timeout, aborting
[ 3196.861227] nvme nvme0: Abort status: 0x0
[ 3275.235571] nvme nvme0: I/O 535 QID 2 timeout, aborting
[ 3275.367650] nvme nvme0: Abort status: 0x0

EDIT: Sometimes these timeouts appear in dmesg without freezing up the pc hmm

I tired to use pci=nomsi, as suggested in the interwebs. However, if I do that, the PC runs significantly hotter (idle CPU temp: 55*C with pci=nomsi, 38*C without). Since this machine is my main productivity machine (music production), I need MSI's (together with RTIRQ) for achieving lower latency without any problems (and this actually works fine wink)

I also used pci=noaer which suppresses the error reports, but the issue with freezing (I/O aborted due to timeout) still occurs, but it's somewhat less frequent.

This issue shows up even as early as during the Arch Linux installation, however no data is ever lost, filesystem is consistent. BTW, RTIRQ and threaded irqs don't seem to influence this issue at all.
Moreover, I got two hard freezes today (blinking caps lock led - kernel panic), unfortunately I was unable to determine the cause.

There are no freezes whatsoever when running Win10 on this machine. The NVMe SSD drive is new, with several hours runtime reported by smartctl. I also updated the BIOS firmware to the newest one from Dell (1.24.4) for this machine. There are no updates to the SSD firmware.

I kindly ask for your help. I've been using Arch on much older Dell machines for ages without any problems. What can I do to solve this issue?
How can I determine the cause of the panics when no dump has been displayed (running graphical session with dmesg -w on the console). Journal just stops at the freeze time, no info there sad
Thanks for any help in advance.


Here are some info about the pc:

[cristos@momentvm ~] $ uname -a
Linux momentvm 5.2.13-arch1-1-ARCH #1 SMP PREEMPT Fri Sep 6 17:52:33 UTC 2019 x86_64 GNU/Linux
[cristos@momentvm ~] $ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-linux root=UUID=685aeb9c-5d62-4163-8bf1-e140320e4666 rw radeon.si_support=1 amdgpu.si_support=0 pci=noaer threadirqs loglevel=3

Note: radeon and amdgpu parameters were my attempt to solve another issue with this machine, posted on this bbs as well smile

Interrupts:

[cristos@momentvm ~] $ cat /proc/interrupts
            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       
   0:          9          0          0          0          0          0          0          0  IR-IO-APIC    2-edge      timer
   1:      11189          0          0       2957          0          0          0          0  IR-IO-APIC    1-edge      i8042
   8:          0          0          0          0          0          0          1          0  IR-IO-APIC    8-edge      rtc0
   9:        490        470          0          0          0          0          0          0  IR-IO-APIC    9-fasteoi   acpi
  12:     233599          0     109348          0          0          0          0          0  IR-IO-APIC   12-edge      i8042
  16:          0          0          0          0          0          0          0          0  IR-IO-APIC   16-fasteoi   i801_smbus
  31:          0          0          0          0          0          0          0          0  IR-IO-APIC   31-edge      smo8800
 120:          0          0          0          0          0          0          0          0  DMAR-MSI    0-edge      dmar0
 121:          0          0          0          0          0          0          0          0  DMAR-MSI    1-edge      dmar1
 122:          0          0          0          0          0          0          0          0  IR-PCI-MSI 16384-edge      PCIe PME
 123:          0          0          0          0          0          0          0          0  IR-PCI-MSI 458752-edge      PCIe PME
 124:          0          0          0          0          0          0          0          0  IR-PCI-MSI 462848-edge      PCIe PME
 125:          0          0          0          0          0          0          0          0  IR-PCI-MSI 475136-edge      PCIe PME
 126:          0         43          0          0          0          0          0          0  IR-PCI-MSI 360448-edge      mei_me
 127:          0          0       4894          0          0          0          0          0  IR-PCI-MSI 327680-edge      xhci_hcd
 128:          0          0          0         23          0          0          0          0  IR-PCI-MSI 2097152-edge      nvme0q0
 129:          0          0          0          0        149          0          0          0  IR-PCI-MSI 514048-edge      snd_hda_intel:card1
 130:          0          0          0          0          0       1110          0          0  IR-PCI-MSI 520192-edge      enp0s31f6
 131:          0          0          0          0       1175          0          0          0  IR-PCI-MSI 2097153-edge      nvme0q1
 132:          0        304          0          0          0          0          0          0  IR-PCI-MSI 2097154-edge      nvme0q2
 133:          0          0        388          0          0          0          0          0  IR-PCI-MSI 2097155-edge      nvme0q3
 134:         38         42          0          0          0          0          0          0  IR-PCI-MSI 524288-edge      radeon
 135:         13          0          0          0         22          0          0          0  IR-PCI-MSI 1572864-edge      rtsx_pci
 136:          0          0          0        427          0          0          0          0  IR-PCI-MSI 2097156-edge      nvme0q4
 137:          0          0          0          0          0          0          0          0  IR-PCI-MSI 376832-edge      ahci[0000:00:17.0]
 138:          0          0          0          0          0        288          0          0  IR-PCI-MSI 2097157-edge      nvme0q5
 139:          0          0          0          0          0          0        202          0  IR-PCI-MSI 2097158-edge      nvme0q6
 140:       4288          0          0          0       3169          0          0          0  IR-PCI-MSI 1048576-edge      iwlwifi
 141:     107122          0          0          0          0      52512          0          0  IR-PCI-MSI 32768-edge      i915
 142:          0          0          0          0          0          0          0        309  IR-PCI-MSI 2097159-edge      nvme0q7
 NMI:          1          3          3          2          2          2          2          2   Non-maskable interrupts
 LOC:     198058     146347     153115     130486     135367     137948     135278     134512   Local timer interrupts
 SPU:          0          0          0          0          0          0          0          0   Spurious interrupts
 PMI:          1          3          3          2          2          2          2          2   Performance monitoring interrupts
 IWI:        526          6          8         10          7        129          9          5   IRQ work interrupts
 RTR:          6          0          0          0          0          0          0          0   APIC ICR read retries
 RES:      29969      19686      11233       7545       6791       6101       5586       5743   Rescheduling interrupts
 CAL:      10124      10589      10611      11430       9431      10948      11996      10447   Function call interrupts
 TLB:      12604      13067      13250      13621      10154      13521      14177      13371   TLB shootdowns
 TRM:          0          0          0          0          0          0          0          0   Thermal event interrupts
 THR:          0          0          0          0          0          0          0          0   Threshold APIC interrupts
 DFR:          0          0          0          0          0          0          0          0   Deferred Error APIC interrupts
 MCE:          0          0          0          0          0          0          0          0   Machine check exceptions
 MCP:         12         13         13         13         13         13         13         13   Machine check polls
 HYP:          0          0          0          0          0          0          0          0   Hypervisor callback interrupts
 HRE:          0          0          0          0          0          0          0          0   Hyper-V reenlightenment interrupts
 HVS:          0          0          0          0          0          0          0          0   Hyper-V stimer0 interrupts
 ERR:          2
 MIS:          0
 PIN:          0          0          0          0          0          0          0          0   Posted-interrupt notification event
 NPI:          0          0          0          0          0          0          0          0   Nested posted-interrupt event
 PIW:          0          0          0          0          0          0          0          0   Posted-interrupt wakeup event

Smartctl output:

smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.2.13-arch1-1-ARCH] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       XPG GAMMIX S5
Serial Number:                      2J2020044509
Firmware Version:                   V9001b31
PCI Vendor/Subsystem ID:            0x10ec
IEEE OUI Identifier:                0x00e04c
Controller ID:                      1
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,024,209,543,168 [1.02 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            525433 334e323035
Local Time is:                      Thu Sep 12 18:45:28 2019 CEST
Firmware Updates (0x02):            1 Slot
Optional Admin Commands (0x0006):   Format Frmw_DL
Optional NVM Commands (0x0014):     DS_Mngmt Sav/Sel_Feat
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     118 Celsius
Critical Comp. Temp. Threshold:     150 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    50.00W       -        -    0  0  0  0        0       0

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        35 Celsius
Available Spare:                    100%
Available Spare Threshold:          32%
Percentage Used:                    0%
Data Units Read:                    1,344,890 [688 GB]
Data Units Written:                 2,102,462 [1.07 TB]
Host Read Commands:                 9,151,608
Host Write Commands:                10,553,286
Controller Busy Time:               0
Power Cycles:                       35
Power On Hours:                     16
Unsafe Shutdowns:                   19
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, max 8 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0          1     0  0x0000  0x0000  0x000            0     0     -
  6 1219368206019409729     0  0x0000  0x0000  0x000            0     0     -

Modules:

[cristos@momentvm ~] $ lsmod
Module                  Size  Used by
ccm                    20480  3
joydev                 28672  0
mousedev               24576  0
snd_hda_codec_hdmi     69632  1
arc4                   16384  2
intel_rapl             28672  0
x86_pkg_temp_thermal    20480  0
intel_powerclamp       20480  0
coretemp               20480  0
kvm_intel             311296  0
uvcvideo              114688  0
snd_hda_codec_realtek   126976  1
videobuf2_vmalloc      20480  1 uvcvideo
iwlmvm                466944  0
videobuf2_memops       20480  1 videobuf2_vmalloc
snd_hda_codec_generic    94208  1 snd_hda_codec_realtek
kvm                   770048  1 kvm_intel
videobuf2_v4l2         28672  1 uvcvideo
irqbypass              16384  1 kvm
iTCO_wdt               16384  0
videobuf2_common       57344  2 videobuf2_v4l2,uvcvideo
nls_iso8859_1          16384  1
nls_cp437              20480  1
snd_hda_intel          53248  3
crct10dif_pclmul       16384  1
dell_wmi               20480  0
crc32_pclmul           16384  0
mac80211              999424  1 iwlmvm
mei_hdcp               24576  0
mei_wdt                16384  0
iTCO_vendor_support    16384  1 iTCO_wdt
intel_wmi_thunderbolt    20480  0
wmi_bmof               16384  0
vfat                   24576  1
dell_laptop            28672  0
sparse_keymap          16384  1 dell_wmi
ghash_clmulni_intel    16384  0
fat                    86016  1 vfat
ledtrig_audio          16384  3 snd_hda_codec_generic,snd_hda_codec_realtek,dell_laptop
snd_usb_audio         278528  0
videodev              237568  3 videobuf2_v4l2,uvcvideo,videobuf2_common
ppdev                  24576  0
snd_hda_codec         159744  4 snd_hda_codec_generic,snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec_realtek
aesni_intel           372736  2
snd_usbmidi_lib        40960  1 snd_usb_audio
iwlwifi               389120  1 iwlmvm
dell_smbios            32768  2 dell_wmi,dell_laptop
snd_rawmidi            45056  1 snd_usbmidi_lib
dell_wmi_descriptor    20480  2 dell_wmi,dell_smbios
btusb                  57344  0
snd_hda_core          102400  5 snd_hda_codec_generic,snd_hda_codec_hdmi,snd_hda_intel,snd_hda_codec,snd_hda_codec_realtek
aes_x86_64             20480  1 aesni_intel
i915                 2265088  32
crypto_simd            16384  1 aesni_intel
btrtl                  20480  1 btusb
cryptd                 24576  2 crypto_simd,ghash_clmulni_intel
dcdbas                 20480  1 dell_smbios
snd_seq_device         16384  1 snd_rawmidi
glue_helper            16384  1 aesni_intel
dell_smm_hwmon         20480  0
intel_cstate           16384  0
btbcm                  16384  1 btusb
media                  61440  5 videodev,snd_usb_audio,videobuf2_v4l2,uvcvideo,videobuf2_common
btintel                28672  1 btusb
intel_uncore          139264  0
snd_hwdep              20480  2 snd_usb_audio,snd_hda_codec
intel_rapl_perf        16384  0
psmouse               180224  0
bluetooth             675840  5 btrtl,btintel,btbcm,btusb
pcspkr                 16384  0
input_leds             16384  0
e1000e                286720  0
cfg80211              856064  3 iwlmvm,iwlwifi,mac80211
snd_pcm               135168  5 snd_hda_codec_hdmi,snd_hda_intel,snd_usb_audio,snd_hda_codec,snd_hda_core
rtsx_pci_ms            24576  0
i2c_i801               36864  0
snd_timer              40960  1 snd_pcm
memstick               20480  1 rtsx_pci_ms
mei_me                 45056  2
ecdh_generic           16384  1 bluetooth
snd                   110592  18 snd_hda_codec_generic,snd_seq_device,snd_hda_codec_hdmi,snd_hwdep,snd_hda_intel,snd_usb_audio,snd_usbmidi_lib,snd_hda_codec,snd_hda_codec_realtek,snd_timer,snd_pcm,snd_rawmidi
tpm_crb                20480  0
soundcore              16384  1 snd
ecc                    32768  1 ecdh_generic
pcc_cpufreq            20480  0
mei                   126976  5 mei_wdt,mei_hdcp,mei_me
processor_thermal_device    20480  0
intel_gtt              24576  1 i915
intel_soc_dts_iosf     20480  1 processor_thermal_device
intel_pch_thermal      16384  0
wmi                    36864  5 intel_wmi_thunderbolt,dell_wmi,wmi_bmof,dell_smbios,dell_wmi_descriptor
parport_pc             53248  0
parport                61440  2 parport_pc,ppdev
evdev                  24576  16
tpm_tis                16384  0
mac_hid                16384  0
tpm_tis_core           24576  1 tpm_tis
dell_smo8800           20480  0
tpm                    73728  3 tpm_tis,tpm_crb,tpm_tis_core
int3400_thermal        20480  0
dell_rbtn              20480  0
int3403_thermal        16384  0
battery                24576  0
rfkill                 28672  9 bluetooth,dell_laptop,dell_rbtn,cfg80211
int340x_thermal_zone    16384  2 int3403_thermal,processor_thermal_device
acpi_thermal_rel       16384  1 int3400_thermal
rng_core               16384  1 tpm
ac                     16384  0
sg                     40960  0
crypto_user            16384  0
ip_tables              36864  0
x_tables               49152  1 ip_tables
ext4                  770048  2
crc32c_generic         16384  0
crc16                  16384  2 bluetooth,ext4
mbcache                16384  1 ext4
jbd2                  135168  1 ext4
rtsx_pci_sdmmc         32768  0
serio_raw              20480  0
mmc_core              184320  1 rtsx_pci_sdmmc
atkbd                  36864  0
libps2                 20480  2 atkbd,psmouse
ahci                   40960  0
libahci                40960  1 ahci
libata                282624  2 libahci,ahci
xhci_pci               20480  0
crc32c_intel           24576  3
xhci_hcd              278528  1 xhci_pci
scsi_mod              249856  2 libata,sg
rtsx_pci               81920  2 rtsx_pci_sdmmc,rtsx_pci_ms
i8042                  32768  1 dell_laptop
serio                  28672  6 serio_raw,atkbd,psmouse,i8042
radeon               1646592  1
amdgpu               3973120  0
gpu_sched              36864  1 amdgpu
i2c_algo_bit           16384  3 amdgpu,radeon,i915
drm_kms_helper        225280  3 amdgpu,radeon,i915
syscopyarea            16384  1 drm_kms_helper
sysfillrect            16384  1 drm_kms_helper
sysimgblt              16384  1 drm_kms_helper
fb_sys_fops            16384  1 drm_kms_helper
ttm                   118784  2 amdgpu,radeon
drm                   503808  17 gpu_sched,drm_kms_helper,amdgpu,radeon,i915,ttm
agpgart                53248  3 intel_gtt,ttm,drm

Lspci for the drive:

04:00.0 Non-Volatile memory controller: Realtek Semiconductor Co., Ltd. Device 5762 (rev ec) (prog-if 10)
        Subsystem: Realtek Semiconductor Co., Ltd. Device 5762
        Control: I/O+ Mem+ BusMaster+ SpecCycle+ MemWINV+ VGASnoop+ ParErr+ Stepping+ SERR+ FastB2B+ DisINTx+
        Status: Cap+ 66MHz+ UDF+ FastB2B+ ParErr+ DEVSEL=?? >TAbort+ <TAbort+ <MAbort+ >SERR+ <PERR+ INTx+
        Latency: 0
        Interrupt: pin A routed to IRQ 16
        NUMA node: 0
        Region 0: Memory at ef000000 (64-bit, non-prefetchable) [size=16K]
        Region 5: Memory at ef004000 (32-bit, non-prefetchable) [size=8K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] MSI: Enable- Count=1/8 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [70] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 25.000W
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s (ok), Width x4 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR+, OBFF Via message/WAKE#
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
                         AtomicOpsCtl: ReqEn-
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
        Capabilities: [b0] MSI-X: Enable+ Count=8 Masked-
                Vector table: BAR=0 offset=00002000
                PBA: BAR=0 offset=00003000
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr+ BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [148 v1] Device Serial Number 00-00-00-01-00-4c-e0-00
        Capabilities: [158 v1] Secondary PCI Express <?>
        Capabilities: [178 v1] Latency Tolerance Reporting
                Max snoop latency: 3145728ns
                Max no snoop latency: 3145728ns
        Capabilities: [180 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=60us PortTPowerOnTime=60us
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                           T_CommonMode=0us LTR1.2_Threshold=0ns
                L1SubCtl2: T_PwrOn=10us
        Kernel driver in use: nvme

Last edited by cristos666 (2019-09-24 05:50:31)

Offline

#2 2019-09-13 10:02:47

Lone_Wolf
Forum Moderator
From: Netherlands, Europe
Registered: 2005-10-04
Posts: 11,925

Re: [CLOSED] PCIe issues, NVMe storage I/O timeout on vanilla Arch kernel

Post full lspci -k and journalctl -b output please.
For the journal output you should use a pastebin client


Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.


(A works at time B)  && (time C > time B ) ≠  (A works at time C)

Offline

#3 2019-09-13 12:06:20

cristos666
Member
Registered: 2019-09-12
Posts: 8

Re: [CLOSED] PCIe issues, NVMe storage I/O timeout on vanilla Arch kernel

Hi,

I can now freeze my PC to a kernel panic on demand.
It happens when I start 10 file transfers over FTP with FileZilla - literally 200k of small files (around a 1M or less). Then, after a while (about 2-3mins) this happens:
https://drive.google.com/open?id=13MLaq … q_EPOum-EN
(sorry posting a link to a photo, but this was the only way of catching the panic)

After a reboot the drive is not detected by the system, another power cycle is needed for the system to see it again. A drive failing?

Here's my lspci -k output

[cristos@momentvm ~] $ lspci -k
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Host Bridge/DRAM Registers (rev 07)
        Subsystem: Dell Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Host Bridge/DRAM Registers
        Kernel driver in use: skl_uncore
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 07)
        Kernel driver in use: pcieport
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 530 (rev 06)
        DeviceName:  Onboard IGD
        Subsystem: Dell HD Graphics 530
        Kernel driver in use: i915
        Kernel modules: i915
00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 07)
        Subsystem: Dell Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem
        Kernel driver in use: proc_thermal
        Kernel modules: processor_thermal_device
00:14.0 USB controller: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller (rev 31)
        Subsystem: Dell 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller
        Kernel driver in use: xhci_hcd
        Kernel modules: xhci_pci
00:14.2 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Thermal Subsystem (rev 31)
        Subsystem: Dell 100 Series/C230 Series Chipset Family Thermal Subsystem
        Kernel driver in use: intel_pch_thermal
        Kernel modules: intel_pch_thermal
00:16.0 Communication controller: Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1 (rev 31)
        Subsystem: Dell 100 Series/C230 Series Chipset Family MEI Controller
        Kernel driver in use: mei_me
        Kernel modules: mei_me
00:17.0 SATA controller: Intel Corporation Q170/Q150/B150/H170/H110/Z170/CM236 Chipset SATA Controller [AHCI Mode] (rev 31)
        Subsystem: Dell Q170/Q150/B150/H170/H110/Z170/CM236 Chipset SATA Controller [AHCI Mode]
        Kernel driver in use: ahci
        Kernel modules: ahci
00:1c.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #2 (rev f1)
        Kernel driver in use: pcieport
00:1c.2 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #3 (rev f1)
        Kernel driver in use: pcieport
00:1d.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #9 (rev f1)
        Kernel driver in use: pcieport
00:1f.0 ISA bridge: Intel Corporation CM236 Chipset LPC/eSPI Controller (rev 31)
        Subsystem: Dell CM236 Chipset LPC/eSPI Controller
00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31)
        Subsystem: Dell 100 Series/C230 Series Chipset Family Power Management Controller
00:1f.3 Audio device: Intel Corporation 100 Series/C230 Series Chipset Family HD Audio Controller (rev 31)
        Subsystem: Dell 100 Series/C230 Series Chipset Family HD Audio Controller
        Kernel driver in use: snd_hda_intel
        Kernel modules: snd_hda_intel
00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31)
        Subsystem: Dell 100 Series/C230 Series Chipset Family SMBus
        Kernel driver in use: i801_smbus
        Kernel modules: i2c_i801
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-LM (rev 31)
        Subsystem: Dell Ethernet Connection (2) I219-LM
        Kernel driver in use: e1000e
        Kernel modules: e1000e
01:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Cape Verde PRO / Venus LE / Tropo PRO-L [Radeon HD 8830M / R7 250 / R7 M465X] (rev 87)
        Subsystem: Dell Cape Verde PRO / Venus LE / Tropo PRO-L [Radeon HD 8830M / R7 250 / R7 M465X]
        Kernel driver in use: amdgpu
        Kernel modules: radeon, amdgpu
02:00.0 Network controller: Intel Corporation Wireless 8260 (rev 3a)
        Subsystem: Intel Corporation Wireless 8260
        Kernel driver in use: iwlwifi
        Kernel modules: iwlwifi
03:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader (rev 01)
        Subsystem: Dell RTS525A PCI Express Card Reader
        Kernel driver in use: rtsx_pci
        Kernel modules: rtsx_pci
04:00.0 Non-Volatile memory controller: Realtek Semiconductor Co., Ltd. Device 5762 (rev 01)
        Subsystem: Realtek Semiconductor Co., Ltd. Device 5762
        Kernel driver in use: nvme

Here's my journal -b:

https://hastebin.com/ebazejubel.coffeescript

Offline

#4 2019-09-13 13:43:46

Swiggles
Member
Registered: 2014-08-02
Posts: 266

Re: [CLOSED] PCIe issues, NVMe storage I/O timeout on vanilla Arch kernel

nvme_core.default_ps_max_latency_us=5500

Add this to your kernel parameters and try if it changes anything for you please.
This is mentioned in multiple places due to buggy low power state of nvme drives.

Offline

#5 2019-09-13 16:24:43

cristos666
Member
Registered: 2019-09-12
Posts: 8

Re: [CLOSED] PCIe issues, NVMe storage I/O timeout on vanilla Arch kernel

I also saw several reports on that, however, the value 5500us was for certain Samsung drives.

However, following https://wiki.archlinux.org/index.php/So … drive/NVMe I looked into my drive using nvme-cli.

[cristos@momentvm ~] $ sudo nvme get-feature -f 0x0c /dev/nvme0
get-feature:0xc (Autonomous Power State Transition), Current value:0x000001
       0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f
0000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"

So, APST is enabled, yet there are no non-zero values in the table, so, according to the wiki:

If APST is enabled but no non-zero states appear in the table, the latencies might be too high for any states to be enabled by default. The output of # nvme id-ctrl /dev/nvme[0-9] should show the available non-operational power states of the NVME controller. If the total latency of any state (enlat + xlat) is greater than 25000 (25ms) you must pass a value at least that high as parameter default_ps_max_latency_us for the nvme_core kernel module. This should enable APST and make the table in # nvme get-feature show the entries.

Here's the output of the command:

[cristos@momentvm ~] $ sudo nvme id-ctrl /dev/nvme0n1            
NVME Identify Controller:
vid       : 0x10ec
ssvid     : 0x10ec
sn        : 2J2020044509        
mn        : XPG GAMMIX S5                           
fr        : V9001b31
rab       : 0
ieee      : 00e04c
cmic      : 0
mdts      : 5
cntlid    : 0x1
ver       : 0x10300
rtd3r     : 0xf4240
rtd3e     : 0x1e8480
oaes      : 0
ctratt    : 0
rrls      : 0
crdt1     : 0
crdt2     : 0
crdt3     : 0
oacs      : 0x6
acl       : 0
aerl      : 3
frmw      : 0x2
lpa       : 0x3
elpe      : 7
npss      : 0
avscc     : 0x1
apsta     : 0x1
wctemp    : 391
cctemp    : 423
mtfa      : 20
hmpre     : 16384
hmmin     : 16384
tnvmcap   : 0
unvmcap   : 0
rpmbs     : 0
edstt     : 0
dsto      : 0
fwug      : 1
kas       : 0
hctma     : 0
mntmt     : 0
mxtmt     : 0
sanicap   : 0
hmminds   : 0
hmmaxd    : 0
nsetidmax : 0
anatt     : 0
anacap    : 0
anagrpmax : 0
nanagrpid : 0
sqes      : 0x66
cqes      : 0x44
maxcmd    : 0
nn        : 1
oncs      : 0x14
fuses     : 0
fna       : 0
vwc       : 0x1
awun      : 0
awupf     : 0
nvscc     : 1
nwpc      : 0
acwu      : 0
sgls      : 0
mnan      : 0
subnqn    : nqn.2018-05.com.example:nvme:nvm-subsystem-OUI00E04C
ioccsz    : 0
iorcsz    : 0
icdoff    : 0
ctrattr   : 0
msdbd     : 0
ps    0 : mp:50.00W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-

Interestingly, there are no powerstates defined by the controller. I tried disabling APST altogether by issuing nvme_core.default_ps_max_latency_us=0 in the kernel command line. Used nvme-cli get-feature to confirm APST is disabled and commenced the test.

I got kernel panic after 3-4 minutes of downloading small files - so nothing changed.

I will also try extremely high latency of 500000 (the default value is 10000) to check if the states table gets populated and thus solving the problem (as a long shot).

Anything else I can try to make it work?

Thanks!

Offline

#6 2019-09-13 17:05:45

cristos666
Member
Registered: 2019-09-12
Posts: 8

Re: [CLOSED] PCIe issues, NVMe storage I/O timeout on vanilla Arch kernel

As I expected, nothing changed with 500000us timeout. I managed to catch the missing part of the dump - it was a General Protection Fault, find the links to imgs below. (sorry for the poor quality, these are single frames extracted from a 60fps mobile phone movie smile )

https://drive.google.com/open?id=1qlBEF … AARdlEUYYW
https://drive.google.com/open?id=1fWUG- … EYxlV7kdhv

I also noticed an interesting behavior - after each crash power cycling won't wake it up. Only performing an onboard Diagnostics drive test (which fails due to no drive being detected) and then a reboot will. WTH? smile

Is it feasible to file an upstream bug with all the data here?

Thanks!

Last edited by cristos666 (2019-09-13 17:08:39)

Offline

#7 2019-09-13 18:29:37

Swiggles
Member
Registered: 2014-08-02
Posts: 266

Re: [CLOSED] PCIe issues, NVMe storage I/O timeout on vanilla Arch kernel

It is not only Samsung, but it popped up in other places as well, e.g.: https://community.wd.com/t/linux-suppor … 8/225446/9
Anyway you did test it without success. Thanks, it would have killed me not knowing! :-)

I have seen the same behavior on some GPUs (mostly AMD) that suffer the reset bug.
You have to shutdown the machine and power back up and not just reboot, because some internal state is not flushed out. If you could boot I would suspect that you actually would receive some PCIe header errors.

Last edited by Swiggles (2019-09-13 18:30:08)

Offline

#8 2019-09-19 06:44:30

cristos666
Member
Registered: 2019-09-12
Posts: 8

Re: [CLOSED] PCIe issues, NVMe storage I/O timeout on vanilla Arch kernel

Ok, I got some clarifications on the topic.

I returned the PC since I suspect a defective motherboard and/or NVMe drive.

The errors also appeared on Fedora and Windows as well, albeit after some uptime, not as immediately as in Arch.
The symptoms were exactly the same - freezes for a few seconds and either "corrected PCIe errors" (when Inter RST was disabled) or "I/O errors/timeouts/retries" (when RST was enabled) were in the logs.

I guess this topic can be closed now. Thanks for help!

Offline

#9 2019-09-19 08:37:23

V1del
Forum Moderator
Registered: 2012-10-16
Posts: 21,743

Re: [CLOSED] PCIe issues, NVMe storage I/O timeout on vanilla Arch kernel

Please mark it solved or workarounded or some such by editing your initial posts subject line.

Offline

Board footer

Powered by FluxBB