You are not logged in.
Pages: 1
Hello,
I have 3 SSD drives and for one month, I experienced many COMRESET and other issue with ata1 / sda drive over the course of 2 weeks. Only with THIS drive. Other drives reported no errors.
The smartctl doesn't show significant errors (see below)
Would you rather think the drive is toast, of is there a possibility of a motherboard or SATA controller or cable issue here?
I have ordered a new drive for replacement and will check in the next week if the issue continues after, of course. If the disk is faulty, it's still under warranty (3 years since SSD life left is still 99% )...
here is an extract of journalctl:
oct. 25 19:00:54 nordi kernel: ata1.00: exception Emask 0x0 SAct 0x30000000 SErr 0x0 action 0x6 frozen
oct. 25 19:01:23 nordi kernel: ata1.00: failed command: READ FPDMA QUEUED
oct. 25 19:01:23 nordi kernel: ata1.00: cmd 60/f8:e0:40:c2:2f/00:00:40:00:00/40 tag 28 ncq dma 126976 in
res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
oct. 25 19:01:23 nordi kernel: ata1.00: status: { DRDY }
oct. 25 19:01:23 nordi kernel: ata1.00: failed command: READ FPDMA QUEUED
oct. 25 19:01:23 nordi kernel: ata1.00: cmd 60/00:e8:38:c3:2f/01:00:40:00:00/40 tag 29 ncq dma 131072 in
res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
oct. 25 19:01:23 nordi kernel: ata1.00: status: { DRDY }
oct. 25 19:01:23 nordi kernel: ata1: hard resetting link
oct. 25 19:01:23 nordi kernel: ata1: link is slow to respond, please be patient (ready=0)
oct. 25 19:01:23 nordi kernel: ata1: COMRESET failed (errno=-16)
oct. 25 19:01:23 nordi kernel: ata1: hard resetting link
oct. 25 19:01:23 nordi kernel: ata1: link is slow to respond, please be patient (ready=0)
oct. 25 19:01:23 nordi kernel: ata1: COMRESET failed (errno=-16)
oct. 25 19:01:23 nordi kernel: ata1: hard resetting link
oct. 25 19:01:23 nordi kernel: ata1: link is slow to respond, please be patient (ready=0)
oct. 25 19:01:23 nordi kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
oct. 25 19:01:23 nordi kernel: ata1.00: configured for UDMA/133
oct. 25 19:01:23 nordi kernel: ata1: EH complete
oct. 30 16:05:29 nordi kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
oct. 30 16:05:42 nordi kernel: ata1.00: failed command: FLUSH CACHE EXT
oct. 30 16:05:42 nordi kernel: ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 15
res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
oct. 30 16:05:42 nordi kernel: ata1.00: status: { DRDY }
oct. 30 16:05:42 nordi kernel: ata1: hard resetting link
oct. 30 16:05:42 nordi kernel: ata1: link is slow to respond, please be patient (ready=0)
oct. 30 16:05:42 nordi kernel: ata1: COMRESET failed (errno=-16)
oct. 30 16:05:42 nordi kernel: ata1: hard resetting link
oct. 30 16:05:42 nordi kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
oct. 30 16:05:42 nordi kernel: ata1.00: configured for UDMA/133
oct. 30 16:05:42 nordi kernel: ata1.00: retrying FLUSH 0xea Emask 0x4
oct. 30 16:05:42 nordi kernel: ata1: EH complete
oct. 30 18:29:27 nordi kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
oct. 30 18:29:37 nordi kernel: ata1.00: failed command: FLUSH CACHE EXT
oct. 30 18:29:37 nordi kernel: ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 23
res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
oct. 30 18:29:37 nordi kernel: ata1.00: status: { DRDY }
oct. 30 18:29:37 nordi kernel: ata1: hard resetting link
oct. 30 18:29:37 nordi kernel: ata1: link is slow to respond, please be patient (ready=0)
oct. 30 18:29:37 nordi kernel: ata1: COMRESET failed (errno=-16)
oct. 30 18:29:37 nordi kernel: ata1: hard resetting link
oct. 30 18:29:37 nordi kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
oct. 30 18:29:37 nordi kernel: ata1.00: configured for UDMA/133
oct. 30 18:29:37 nordi kernel: ata1.00: retrying FLUSH 0xea Emask 0x4
oct. 30 18:29:37 nordi kernel: ata1: EH complete
nov. 04 16:28:53 nordi kernel: ata1: COMRESET failed (errno=-16)
nov. 04 16:28:53 nordi kernel: ata1: hard resetting link
nov. 04 16:28:53 nordi kernel: ata1: link is slow to respond, please be patient (ready=0)
nov. 04 16:28:53 nordi kernel: ata1: COMRESET failed (errno=-16)
nov. 04 16:28:53 nordi kernel: ata1: hard resetting link
nov. 04 16:28:53 nordi kernel: ata1: link is slow to respond, please be patient (ready=0)
nov. 04 16:28:53 nordi kernel: ata1: COMRESET failed (errno=-16)
nov. 04 16:28:53 nordi kernel: ata1: limiting SATA link speed to 3.0 Gbps
nov. 04 16:28:53 nordi kernel: ata1: hard resetting link
nov. 04 16:28:53 nordi kernel: ata1: COMRESET failed (errno=-16)
nov. 04 16:28:53 nordi kernel: ata1: reset failed, giving up
nov. 04 16:28:53 nordi kernel: ata1.00: disabled
nov. 04 16:28:53 nordi kernel: ata1: EH complete
nov. 04 16:28:53 nordi kernel: sd 0:0:0:0: [sda] tag#28 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=88s
nov. 04 16:28:53 nordi kernel: sd 0:0:0:0: [sda] tag#28 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
nov. 04 16:28:53 nordi kernel: blk_update_request: I/O error, dev sda, sector 2064 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
nov. 04 16:28:53 nordi kernel: md: super_written gets error=-5
nov. 04 16:28:53 nordi kernel: md/raid1:md127: Disk failure on sda1, disabling device.
md/raid1:md127: Operation continuing on 1 devices.
nov. 04 16:28:53 nordi kernel: sd 0:0:0:0: [sda] tag#29 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=90s
nov. 04 16:28:53 nordi kernel: sd 0:0:0:0: [sda] tag#29 CDB: Read(10) 28 00 1e 8d 94 00 00 05 80 00
nov. 04 16:28:53 nordi kernel: blk_update_request: I/O error, dev sda, sector 512594944 op 0x0:(READ) flags 0x0 phys_seg 29 prio class 0
nov. 04 16:28:53 nordi kernel: md: md127: data-check interrupted.
nov. 04 16:28:53 nordi kernel: sd 0:0:0:0: [sda] tag#30 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=90s
nov. 04 16:28:53 nordi kernel: sd 0:0:0:0: [sda] tag#30 CDB: Read(10) 28 00 1e 8d 8a 00 00 0a 00 00
nov. 04 16:28:53 nordi kernel: blk_update_request: I/O error, dev sda, sector 512592384 op 0x0:(READ) flags 0x4000 phys_seg 138 prio class 0
ov. 05 18:39:31 nordi kernel: ata1.00: exception Emask 0x0 SAct 0x1000000 SErr 0x0 action 0x6 frozen
nov. 05 18:39:31 nordi kernel: ata1.00: failed command: READ FPDMA QUEUED
nov. 05 18:39:31 nordi kernel: ata1.00: cmd 60/20:c0:40:7f:4a/00:00:2e:00:00/40 tag 24 ncq dma 16384 in
res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
nov. 05 18:39:31 nordi kernel: ata1.00: status: { DRDY }
nov. 05 18:39:31 nordi kernel: ata1: hard resetting link
nov. 05 18:39:31 nordi kernel: ata1: link is slow to respond, please be patient (ready=0)
nov. 05 18:39:31 nordi kernel: ata1: COMRESET failed (errno=-16)
nov. 05 18:39:31 nordi kernel: ata1: hard resetting link
nov. 05 18:39:31 nordi kernel: ata1: link is slow to respond, please be patient (ready=0)
nov. 05 18:39:31 nordi kernel: ata1: COMRESET failed (errno=-16)
nov. 05 18:39:31 nordi kernel: ata1: hard resetting link
nov. 05 18:39:31 nordi kernel: ata1: link is slow to respond, please be patient (ready=0)
nov. 05 18:39:31 nordi kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
nov. 05 18:39:31 nordi kernel: ata1.00: configured for UDMA/133
nov. 05 18:39:31 nordi kernel: ata1: EH completeThe Smartctl output (the disk doesn't apparently support self tests)
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x0032 000 100 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 5929
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 1084
148 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
149 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
167 Write_Protect_Mode 0x0000 100 100 000 Old_age Offline - 0
168 SATA_Phy_Error_Count 0x0012 100 100 000 Old_age Always - 0
169 Bad_Block_Rate 0x0000 100 100 000 Old_age Offline - 8
170 Bad_Blk_Ct_Erl/Lat 0x0000 100 100 010 Old_age Offline - 0/19
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 MaxAvgErase_Ct 0x0000 100 100 000 Old_age Offline - 36 (Average 10)
181 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
182 Erase_Fail_Count 0x0000 100 100 000 Old_age Offline - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
192 Unsafe_Shutdown_Count 0x0012 100 100 000 Old_age Always - 27
194 Temperature_Celsius 0x0022 076 054 000 Old_age Always - 24 (Min/Max 17/46)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
199 SATA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
218 CRC_Error_Count 0x0032 100 100 000 Old_age Always - 1
231 SSD_Life_Left 0x0000 001 001 000 Old_age Offline - 99
233 Flash_Writes_GiB 0x0032 100 100 000 Old_age Always - 8958
241 Lifetime_Writes_GiB 0x0032 100 100 000 Old_age Always - 8592
242 Lifetime_Reads_GiB 0x0032 100 100 000 Old_age Always - 9277
244 Average_Erase_Count 0x0000 100 100 000 Old_age Offline - 10
245 Max_Erase_Count 0x0000 100 100 000 Old_age Offline - 36
246 Total_Erase_Count 0x0000 100 100 000 Old_age Offline - 906624
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
Selective Self-tests/Logging not supported
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Self-test functions not supportedOffline
The leading errors (posted) are on the link and teh SMART data looks unsuspicious.
=> Broken or loose cable (check both ends and also power supply) or maybe the bus on the board.
Since you've more drives, i'd try to flip cables (on the drive so the "bad" one is connected to where a "good" one previously was) - if the error is then completely gone, switch back. If still gone, the cable was just loose.
Offline
Thank you. I will do that. I was also suspecting the internal drive controller (not the memory), but at this point, this could be anywhere.
BTW, I had tried to disconnect/reconnect the cable, but the issued continued.
Last edited by ivanoff (2021-11-07 21:29:12)
Offline
If flipping the connection doesn't work, i's either the jack on the disk or the controller - in the latter case you also can't trust the smart data anymore.
Offline
Well, I did all sorts of tests, disk and cable switchings, and but the issue didn't come back. This may have been just the cable.
I am now testing the disk in turns with badblocks. Two proved already to be 100% A-OK, I will test the third one soon (recovery and data transfer takes time).
So all in all, it seems fine, except maybe a check I did with the array, that showed a lot of mismatch_cnt (300000), but I understand this can happen on a RAID1? The data doesn't seem corrupted (all my files are there), but it is always difficult to know for sure.
So this mismatch count is kind of weird. I prefer when things go wrong openly, and not silently.
Offline
If fumbling w/ the cables "fixed" it, it was a loose cable or, worst case, a cable is broken and stutters under tension.
(Again: error points the link and the SMART data was unsuspicous)
There's no way to determine that ex post (or, realistically, ever)
Offline
Pages: 1