You are not logged in.

#1 2019-05-01 19:37:10

Koatao
Member
Registered: 2018-08-30
Posts: 92

[SOLVED] HDD/ATA error: SError { CommWake }/Error { IDNF }

Hello,

I've error from my Sata Controller or Hard Drive (if I understand it well), and I'm having a hard time understanding what causes the issue (and why).
I have little to no knowledge about hard drive and sata so if someone could point me to the right direction concerning the understanding of my problem, it would be very appreciated. smile

The error i'm facing is:

May 01 05:20:13 archlen0d1 kernel: ata1.00: exception Emask 0x0 SAct 0x100000e0 SErr 0x40000 action 0x0
May 01 05:20:13 archlen0d1 kernel: ata1.00: irq_stat 0x40000008
May 01 05:20:13 archlen0d1 kernel: ata1: SError: { CommWake }
May 01 05:20:13 archlen0d1 kernel: ata1.00: failed command: WRITE FPDMA QUEUED
May 01 05:20:13 archlen0d1 kernel: ata1.00: cmd 61/40:28:40:1e:d2/05:00:4f:00:00/40 tag 5 ncq dma 688128 out
                                            res 41/10:00:40:1e:d2/00:00:4f:00:00/00 Emask 0x481 (invalid argument) <F>
May 01 05:20:13 archlen0d1 kernel: ata1.00: status: { DRDY ERR }
May 01 05:20:13 archlen0d1 kernel: ata1.00: error: { IDNF }
May 01 05:20:13 archlen0d1 kernel: ata1: hard resetting link
May 01 05:20:13 archlen0d1 kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
May 01 05:20:13 archlen0d1 kernel: ata1.00: configured for UDMA/133
May 01 05:20:13 archlen0d1 kernel: ata1: EH complete

Once this error appears, it keeps repeating every 10 sec, at some point, the system is kinda crashing, forcing me to reboot  to get rid of it.

Here some logs from the journal and information from lshw and lspci commands:

Jourrnal full logs: https://pastebin.com/fsKj6zhu

lspci -vv for the Sata controller:

00:17.0 SATA controller: Intel Corporation Sunrise Point-LP SATA Controller [AHCI mode] (rev 21) (prog-if 01 [AHCI 1.0])
	Subsystem: Lenovo Sunrise Point-LP SATA Controller [AHCI mode]
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 123
	Region 0: Memory at a3130000 (32-bit, non-prefetchable) [size=8K]
	Region 1: Memory at a3138000 (32-bit, non-prefetchable) [size=256]
	Region 2: I/O ports at 5080 [size=8]
	Region 3: I/O ports at 5088 [size=4]
	Region 4: I/O ports at 5060 [size=32]
	Region 5: Memory at a3136000 (32-bit, non-prefetchable) [size=2K]
	Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
		Address: fee00298  Data: 0000
	Capabilities: [70] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [a8] SATA HBA v1.0 BAR4 Offset=00000004
	Kernel driver in use: ahci
	Kernel modules: ahci

lshw -c storage (Sata controller):

  *-storage
       description: SATA controller
       product: Sunrise Point-LP SATA Controller [AHCI mode]
       vendor: Intel Corporation
       physical id: 17
       bus info: pci@0000:00:17.0
       version: 21
       width: 32 bits
       clock: 66MHz
       capabilities: storage msi pm ahci_1.0 bus_master cap_list
       configuration: driver=ahci latency=0
       resources: irq:123 memory:a3130000-a3131fff memory:a3138000-a31380ff ioport:5080(size=8) ioport:5088(size=4) ioport:5060(size=32) memory:a3136000-a31367ff

full lshw output: https://pastebin.com/4sVaUR2F

smartctl --info /dev/sda output:

smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.0.10-arch1-1-ARCH] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD10SPZX-24Z10T0
Serial Number:    WD-WX31A18L8YEE
LU WWN Device Id: 5 0014ee 6b30e4177
Firmware Version: 01.01A01
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed May  1 08:20:05 2019 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

So I'm kinda clueless about how to troubleshot this. I've done the SMART extended test (using smartctl), and it passed without error.
I don't even know if it is a hardware or software issue... I don't know how to reproduce the issue.
I did monitor system logs when it appears, but I didn't record it (bad bad me ^^). It ended up by some failed SIGKILLs running in loop until nothing more happens. I will catch it next time it occurs.
Am I missing something here?
I've read https://wiki.unraid.net/The_Analysis_of_Drive_Issues but, my HDD seems fine.
If you any documentation I can read related to this issue, I will take that too.


PS: forgive me for my poor english writing's skills sad

Last edited by Koatao (2019-05-21 06:24:00)

Offline

#2 2019-05-01 20:08:59

V1del
Forum Moderator
Registered: 2012-10-16
Posts: 21,620

Re: [SOLVED] HDD/ATA error: SError { CommWake }/Error { IDNF }

Please post the full results of the smart test, acquirable with

smartctl -a /dev/sda

only that way we can really determine the state of the drive. FWIW there have been a few people that had issues with the SATA Link Power Management enabled by default in recent kernels. If the drive really does look fine on other accounts, this might be something to test. Set that to max_performance

Offline

#3 2019-05-01 20:55:46

Koatao
Member
Registered: 2018-08-30
Posts: 92

Re: [SOLVED] HDD/ATA error: SError { CommWake }/Error { IDNF }

 smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.0.10-arch1-1-ARCH] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD10SPZX-24Z10T0
Serial Number:    WD-WX31A18L8YEE
LU WWN Device Id: 5 0014ee 6b30e4177
Firmware Version: 01.01A01
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed May  1 14:26:13 2019 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(  360) seconds.
Offline data collection
capabilities: 			 (0x71) SMART execute Offline immediate.
					No Auto Offline data collection support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  71) minutes.
Conveyance self-test routine
recommended polling time: 	 (   3) minutes.
SCT capabilities: 	       (0x303d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   195   189   021    Pre-fail  Always       -       1225
  4 Start_Stop_Count        0x0032   090   090   000    Old_age   Always       -       10526
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1088
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1134
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       697
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       131
193 Load_Cycle_Count        0x0032   187   187   000    Old_age   Always       -       39645
194 Temperature_Celsius     0x0022   116   096   000    Old_age   Always       -       27 (Min/Max 15/47)
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       85
206 Flying_Height           0x0022   100   000   000    Old_age   Always       -       36
240 Head_Flying_Hours       0x0032   099   099   000    Old_age   Always       -       887

SMART Error Log Version: 1
ATA Error Count: 775 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 775 occurred at disk power-on lifetime: 1083 hours (45 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 00 00 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ea 00 00 00 00 00 a0 08      00:49:30.448  FLUSH CACHE EXT

Error 774 occurred at disk power-on lifetime: 1082 hours (45 days + 2 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 00 00 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ea 00 00 00 00 00 a0 00      00:23:56.985  FLUSH CACHE EXT

Error 773 occurred at disk power-on lifetime: 1079 hours (44 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 00 00 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ea 00 00 00 00 00 a0 08      00:18:59.954  FLUSH CACHE EXT

Error 772 occurred at disk power-on lifetime: 1078 hours (44 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 00 00 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ea 00 00 00 00 00 a0 08      00:11:48.884  FLUSH CACHE EXT
  ea 00 00 00 00 00 a0 08      00:11:48.238  FLUSH CACHE EXT
  ea 00 00 00 00 00 a0 08      00:11:48.237  FLUSH CACHE EXT
  ea 00 00 00 00 00 a0 08      00:11:48.237  FLUSH CACHE EXT

Error 771 occurred at disk power-on lifetime: 1074 hours (44 days + 18 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 00 00 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ea 00 00 00 00 00 a0 08      00:41:31.092  FLUSH CACHE EXT
  ea 00 00 00 00 00 a0 08      00:41:12.624  FLUSH CACHE EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      1088         -
# 2  Conveyance offline  Completed without error       00%      1084         -
# 3  Conveyance offline  Completed without error       00%      1084         -
# 4  Short offline       Completed without error       00%      1084         -
# 5  Conveyance offline  Completed without error       00%      1046         -
# 6  Short offline       Completed without error       00%      1046         -
# 7  Conveyance offline  Completed without error       00%       975         -
# 8  Vendor (0x50)       Completed without error       00%       975         -
# 9  Short offline       Completed without error       00%       975         -
#10  Short offline       Completed without error       00%       969         -
#11  Conveyance offline  Completed without error       00%       915         -
#12  Short offline       Completed without error       00%       915         -
#13  Short offline       Completed without error       00%       719         -
#14  Vendor (0x50)       Completed without error       00%        46         -
#15  Short offline       Completed without error       00%        46         -
#16  Vendor (0x50)       Completed without error       00%        44         -
#17  Short offline       Completed without error       00%        44         -
#18  Vendor (0x50)       Completed without error       00%        43         -
#19  Short offline       Completed without error       00%        43         -
#20  Short offline       Completed without error       00%        38         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Offline

#4 2019-05-21 06:22:52

Koatao
Member
Registered: 2018-08-30
Posts: 92

Re: [SOLVED] HDD/ATA error: SError { CommWake }/Error { IDNF }

Hi,
At the end, i did some test: linear read on Windows with Lenovo's tools and non-destructive test with badblocks, i've found hundreds of errors (read and corruptions).
So, it is a dying hard drive after all.

What I'm failing to understand is how to get this with SMART? (See output above)

Oh, and FWIW max_performance for active link stopped SATA's hard reset.

Offline

#5 2019-05-21 09:03:07

V1del
Forum Moderator
Registered: 2012-10-16
Posts: 21,620

Re: [SOLVED] HDD/ATA error: SError { CommWake }/Error { IDNF }

187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       697

Is the biggest indicator in that listing, in addition to the reported errors mentioned earlier (though I'd expect some reallocated sector counts and the like as well to really drive the nail in the coffin)

Offline

#6 2019-05-27 17:30:22

Koatao
Member
Registered: 2018-08-30
Posts: 92

Re: [SOLVED] HDD/ATA error: SError { CommWake }/Error { IDNF }

Ooh okay! That makes sense!!
Lenovo confirmed it was hardware failure.
Thx @V1del for you help!

Offline

Board footer

Powered by FluxBB