You are not logged in.

#1 2014-10-26 16:05:47

Ghostofkendo
Member
From: France
Registered: 2008-10-14
Posts: 10

SSD NCQ errors after kernel upgrade (with intel-ucode)

Hello,

Problem: At boot, after the kernel and the initial ramdisk were loaded, I get SATA erros regarding NCQ. This causes my laptop (Sony Vaio Pro 13") to take 2 minutes to reach the login manager instead of a mere 5 seconds usually.
Now, this matches exactly the symptoms described on the wiki about NCQ erros on SSDs. I tried disabling NCQ with "libata.force=noncq" in GRUB, and it works (no more errors, 5 seconds boot).

What is intriguing me is that these erros appeared right after I upgraded to Linux kernel 3.17.1 and started using intel-ucode (v20140913), as recommended by the latest news item.
So I tried all the different combinations of:
- downgrading to linux 3.16.4
- loading (or not) intel-ucode.img before initramfs-linux.img
- downgrading to intel-ucode 20140624
But the NCQ erros are still happening.

I checked whether my SSD was starting to fail (by chance, at the same time I did the upgrade), but even an extended SMART test did not report issues:

smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.17.1-1-ARCH] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     SAMSUNG MZHPU256HCGL-00000
Serial Number:    S1ADNYADB07524
LU WWN Device Id: 5 002538 6000286f8
Firmware Version: UXM6401Q
User Capacity:    256,060,514,304 bytes [256 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Oct 26 16:55:17 2014 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x5f) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Abort Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  10) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1334
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       782
177 Wear_Leveling_Count     0x0013   097   097   017    Pre-fail  Always       -       205
178 Used_Rsvd_Blk_Cnt_Chip  0x0013   091   091   010    Pre-fail  Always       -       236
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   091   091   010    Pre-fail  Always       -       462
180 Unused_Rsvd_Blk_Cnt_Tot 0x0013   091   091   010    Pre-fail  Always       -       4786
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         0         -
# 2  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Since it would seem like the latest intel microcode update triggered these errors, I am not sure of what I should do.
I do not feel comfortable just silencing the errors by disabling NCQ.
Should I try to upgrade my SSD's firmware as suggested by the wiki article? Or report a bug?

Any insight would be greatly appreciated.
Thanks in advance

Offline

#2 2014-10-26 21:08:23

frostschutz
Member
Registered: 2013-11-15
Posts: 1,637

Re: SSD NCQ errors after kernel upgrade (with intel-ucode)

Many drives have issues with NCQ. I actually see better overall performance w/o NCQ, so I'm happy to keep it disabled globally.

Stopping errors from occuring is quite different from silencing (let them occur but ignore) them.

I'm using 3.17.1 and latest microcode myself, naturally since I have NCQ disabled anyway I didn't notice any change; not sure if it could be related at all. If the issue persists, it's more likely a coincidence

Last edited by frostschutz (2014-10-26 21:08:51)

Offline

Board footer

Powered by FluxBB