You are not logged in.

#1 2020-12-19 15:12:21

rep_movsd
Member
Registered: 2013-08-24
Posts: 142

NVME timeout on smartctl

A few weeks back I started getting a freeze exactly about 60 seconds after booting up - some GUI things would work, but anything that would access the disk would hang for about 60 seconds.

I have 2 NVME M.2 SSDs in my laptop
I see entries like this in dmesg

[    7.115608] wlan0: associated
[   69.987602] nvme nvme0: I/O 3 QID 0 timeout, reset controller
[  131.479542] nvme nvme0: 8/0/0 default/read/poll queues

As you can see about 1 minute after boot, something happens and 1 minute later the timeout ends, and the system goes back to normal
I never experience any freezes after, until I reboot.

After lot of poking around, I realized that the smartd.service seems to run smartctl on bootup and that hangs - for example when I run manually:

rep ~  $ sudo smartctl -a /dev/nvme0

smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.9.14-arch1-1] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       PC300 NVMe SK hynix 256GB
Serial Number:                      FS6AN51781090AJ0F
Firmware Version:                   20004A00
PCI Vendor/Subsystem ID:            0x1c5c
IEEE OUI Identifier:                0xace42e
Controller ID:                      0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          256,060,514,304 [256 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            ace42e 619001c89a
Local Time is:                      Sat Dec 19 20:28:40 2020 IST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0016):   Format Frmw_DL Self_Test
Optional NVM Commands (0x001e):     Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     88 Celsius
Critical Comp. Temp. Threshold:     90 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     5.87W       -        -    0  0  0  0        5       5
 1 +     2.40W       -        -    1  1  1  1       30      30
 2 +     1.90W       -        -    2  2  2  2      100     100
 3 -   0.1000W       -        -    3  3  3  3     1000    1000
 4 -   0.0060W       -        -    3  3  3  3     1000    5000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 -     512       0         0
 1 -    4096       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        44 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    6%
Data Units Read:                    52,282,923 [26.7 TB]
Data Units Written:                 58,979,235 [30.1 TB]
Host Read Commands:                 838,637,005
Host Write Commands:                1,006,823,611
Controller Busy Time:               6,781
Power Cycles:                       5,517
Power On Hours:                     1,745
Unsafe Shutdowns:                   1,487
Media and Data Integrity Errors:    6
Error Information Log Entries:      12,246
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               44 Celsius

Read Error Information Log failed: NVME_IOCTL_ADMIN_CMD: Interrupted system call

It gets to "Temperature Sensor 1: " and then hangs for a minute before exiting with the error
This disk has a few media errors, but no SMART warning yet

The exact same thing happens on nvme1 which has 0 errors

rep ~  $ s smartctl -a /dev/nvme1
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.9.14-arch1-1] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       PC300 NVMe SK hynix 256GB
Serial Number:                      FS6AN51781090AJ05
Firmware Version:                   20004A00
PCI Vendor/Subsystem ID:            0x1c5c
IEEE OUI Identifier:                0xace42e
Controller ID:                      0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          256,060,514,304 [256 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            ace42e 619001c890
Local Time is:                      Sat Dec 19 20:30:58 2020 IST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0016):   Format Frmw_DL Self_Test
Optional NVM Commands (0x001e):     Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     88 Celsius
Critical Comp. Temp. Threshold:     90 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     5.87W       -        -    0  0  0  0        5       5
 1 +     2.40W       -        -    1  1  1  1       30      30
 2 +     1.90W       -        -    2  2  2  2      100     100
 3 -   0.1000W       -        -    3  3  3  3     1000    1000
 4 -   0.0060W       -        -    3  3  3  3     1000    5000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 -     512       0         0
 1 -    4096       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        45 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    1%
Data Units Read:                    8,137,210 [4.16 TB]
Data Units Written:                 14,645,994 [7.49 TB]
Host Read Commands:                 149,112,947
Host Write Commands:                371,566,822
Controller Busy Time:               8,883
Power Cycles:                       5,517
Power On Hours:                     642
Unsafe Shutdowns:                   1,540
Media and Data Integrity Errors:    0
Error Information Log Entries:      589,944
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               45 Celsius

Read Error Information Log failed: NVME_IOCTL_ADMIN_CMD: Interrupted system call

Both drives have a very large count for "Error Information Log Entries:  " but nvme error-log is only full of entries like:

Entry[62]
.................
error_count     : 0
sqid            : 0
cmdid           : 0
status_field    : 0(SUCCESS: The command completed successfully)
parm_err_loc    : 0
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................
 Entry[63]

I don't think the drives or controller have any problem since this started after a pacman update

This command works fine and exits immediately

smartctl -H -i -c -A /dev/nvme0

Has anyone else experienced this?
Is there a way for me to tell smartd.service to use this set of flags only?

Thanks in advance

Offline

#2 2021-02-04 08:49:56

Dhalsimax
Member
Registered: 2021-02-04
Posts: 2

Re: NVME timeout on smartctl

Hello, same here, I but I use gentoo linux and after some updates to kernel/KDE plasma

I have :

[   90.304975] nvme nvme0: I/O 12 QID 0 timeout, reset controller
[  152.054759] nvme nvme0: 8/0/0 default/read/poll queues

After each logon on KDE or isssuing a smartctl command a new timeout/controller reset occurs.

I suspect the glitch comes after updating kernel form 5.4.81 to 5.4.92 and the KDE locks up at boot is due to some hidden lowlevel SMART command issued by KDE plasma itself on logon.

It's a relative fresh bug I will post it in gentoo bugzilla and I'm confident someone else using gentoo would confirm this bug and maybe found a workaround. I confirm controller and disk works pretty well in windows and also in linux afterall.

Offline

#3 2021-02-04 08:53:42

V1del
Forum Moderator
Registered: 2012-10-16
Posts: 23,430

Re: NVME timeout on smartctl

You can disable KDE's SMART integration in

kcmshell5 kded5

that should lead to KDE not invoking a relevant SMART command in order to mitigate this.

Offline

#4 2021-02-04 09:15:51

Dhalsimax
Member
Registered: 2021-02-04
Posts: 2

Re: NVME timeout on smartctl

V1del wrote:

You can disable KDE's SMART integration in

kcmshell5 kded5

that should lead to KDE not invoking a relevant SMART command in order to mitigate this.

Nice I was wondering who and when you spared me hours reading logs many thanks.

Offline

Board footer

Powered by FluxBB