SSD or MB/Controller/cable issue ?

ivanoff · 2021-11-07 12:27:41

Hello,
I have 3 SSD drives and for one month, I experienced many COMRESET and other issue with ata1 / sda drive over the course of 2 weeks. Only with THIS drive. Other drives reported no errors.
The smartctl doesn't show significant errors (see below)

Would you rather think the drive is toast, of is there a possibility of a motherboard or SATA controller or cable issue here?

I have ordered a new drive for replacement and will check in the next week if the issue continues after, of course. If the disk is faulty, it's still under warranty (3 years since SSD life left is still 99% )...

here is an extract of journalctl:

oct. 25 19:00:54 nordi kernel: ata1.00: exception Emask 0x0 SAct 0x30000000 SErr 0x0 action 0x6 frozen
oct. 25 19:01:23 nordi kernel: ata1.00: failed command: READ FPDMA QUEUED
oct. 25 19:01:23 nordi kernel: ata1.00: cmd 60/f8:e0:40:c2:2f/00:00:40:00:00/40 tag 28 ncq dma 126976 in
                                        res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
oct. 25 19:01:23 nordi kernel: ata1.00: status: { DRDY }
oct. 25 19:01:23 nordi kernel: ata1.00: failed command: READ FPDMA QUEUED
oct. 25 19:01:23 nordi kernel: ata1.00: cmd 60/00:e8:38:c3:2f/01:00:40:00:00/40 tag 29 ncq dma 131072 in
                                        res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
oct. 25 19:01:23 nordi kernel: ata1.00: status: { DRDY }
oct. 25 19:01:23 nordi kernel: ata1: hard resetting link
oct. 25 19:01:23 nordi kernel: ata1: link is slow to respond, please be patient (ready=0)
oct. 25 19:01:23 nordi kernel: ata1: COMRESET failed (errno=-16)
oct. 25 19:01:23 nordi kernel: ata1: hard resetting link
oct. 25 19:01:23 nordi kernel: ata1: link is slow to respond, please be patient (ready=0)
oct. 25 19:01:23 nordi kernel: ata1: COMRESET failed (errno=-16)
oct. 25 19:01:23 nordi kernel: ata1: hard resetting link
oct. 25 19:01:23 nordi kernel: ata1: link is slow to respond, please be patient (ready=0)
oct. 25 19:01:23 nordi kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
oct. 25 19:01:23 nordi kernel: ata1.00: configured for UDMA/133
oct. 25 19:01:23 nordi kernel: ata1: EH complete

oct. 30 16:05:29 nordi kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
oct. 30 16:05:42 nordi kernel: ata1.00: failed command: FLUSH CACHE EXT
oct. 30 16:05:42 nordi kernel: ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 15
                                        res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
oct. 30 16:05:42 nordi kernel: ata1.00: status: { DRDY }
oct. 30 16:05:42 nordi kernel: ata1: hard resetting link
oct. 30 16:05:42 nordi kernel: ata1: link is slow to respond, please be patient (ready=0)
oct. 30 16:05:42 nordi kernel: ata1: COMRESET failed (errno=-16)
oct. 30 16:05:42 nordi kernel: ata1: hard resetting link
oct. 30 16:05:42 nordi kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
oct. 30 16:05:42 nordi kernel: ata1.00: configured for UDMA/133
oct. 30 16:05:42 nordi kernel: ata1.00: retrying FLUSH 0xea Emask 0x4
oct. 30 16:05:42 nordi kernel: ata1: EH complete
oct. 30 18:29:27 nordi kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
oct. 30 18:29:37 nordi kernel: ata1.00: failed command: FLUSH CACHE EXT
oct. 30 18:29:37 nordi kernel: ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 23
                                        res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
oct. 30 18:29:37 nordi kernel: ata1.00: status: { DRDY }
oct. 30 18:29:37 nordi kernel: ata1: hard resetting link
oct. 30 18:29:37 nordi kernel: ata1: link is slow to respond, please be patient (ready=0)
oct. 30 18:29:37 nordi kernel: ata1: COMRESET failed (errno=-16)
oct. 30 18:29:37 nordi kernel: ata1: hard resetting link
oct. 30 18:29:37 nordi kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
oct. 30 18:29:37 nordi kernel: ata1.00: configured for UDMA/133
oct. 30 18:29:37 nordi kernel: ata1.00: retrying FLUSH 0xea Emask 0x4
oct. 30 18:29:37 nordi kernel: ata1: EH complete
nov. 04 16:28:53 nordi kernel: ata1: COMRESET failed (errno=-16)
nov. 04 16:28:53 nordi kernel: ata1: hard resetting link
nov. 04 16:28:53 nordi kernel: ata1: link is slow to respond, please be patient (ready=0)
nov. 04 16:28:53 nordi kernel: ata1: COMRESET failed (errno=-16)
nov. 04 16:28:53 nordi kernel: ata1: hard resetting link
nov. 04 16:28:53 nordi kernel: ata1: link is slow to respond, please be patient (ready=0)
nov. 04 16:28:53 nordi kernel: ata1: COMRESET failed (errno=-16)
nov. 04 16:28:53 nordi kernel: ata1: limiting SATA link speed to 3.0 Gbps
nov. 04 16:28:53 nordi kernel: ata1: hard resetting link
nov. 04 16:28:53 nordi kernel: ata1: COMRESET failed (errno=-16)
nov. 04 16:28:53 nordi kernel: ata1: reset failed, giving up
nov. 04 16:28:53 nordi kernel: ata1.00: disabled
nov. 04 16:28:53 nordi kernel: ata1: EH complete
nov. 04 16:28:53 nordi kernel: sd 0:0:0:0: [sda] tag#28 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=88s
nov. 04 16:28:53 nordi kernel: sd 0:0:0:0: [sda] tag#28 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
nov. 04 16:28:53 nordi kernel: blk_update_request: I/O error, dev sda, sector 2064 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
nov. 04 16:28:53 nordi kernel: md: super_written gets error=-5
nov. 04 16:28:53 nordi kernel: md/raid1:md127: Disk failure on sda1, disabling device.
                               md/raid1:md127: Operation continuing on 1 devices.
nov. 04 16:28:53 nordi kernel: sd 0:0:0:0: [sda] tag#29 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=90s
nov. 04 16:28:53 nordi kernel: sd 0:0:0:0: [sda] tag#29 CDB: Read(10) 28 00 1e 8d 94 00 00 05 80 00
nov. 04 16:28:53 nordi kernel: blk_update_request: I/O error, dev sda, sector 512594944 op 0x0:(READ) flags 0x0 phys_seg 29 prio class 0
nov. 04 16:28:53 nordi kernel: md: md127: data-check interrupted.
nov. 04 16:28:53 nordi kernel: sd 0:0:0:0: [sda] tag#30 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=90s
nov. 04 16:28:53 nordi kernel: sd 0:0:0:0: [sda] tag#30 CDB: Read(10) 28 00 1e 8d 8a 00 00 0a 00 00
nov. 04 16:28:53 nordi kernel: blk_update_request: I/O error, dev sda, sector 512592384 op 0x0:(READ) flags 0x4000 phys_seg 138 prio class 0
ov. 05 18:39:31 nordi kernel: ata1.00: exception Emask 0x0 SAct 0x1000000 SErr 0x0 action 0x6 frozen
nov. 05 18:39:31 nordi kernel: ata1.00: failed command: READ FPDMA QUEUED
nov. 05 18:39:31 nordi kernel: ata1.00: cmd 60/20:c0:40:7f:4a/00:00:2e:00:00/40 tag 24 ncq dma 16384 in
                                        res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
nov. 05 18:39:31 nordi kernel: ata1.00: status: { DRDY }
nov. 05 18:39:31 nordi kernel: ata1: hard resetting link
nov. 05 18:39:31 nordi kernel: ata1: link is slow to respond, please be patient (ready=0)
nov. 05 18:39:31 nordi kernel: ata1: COMRESET failed (errno=-16)
nov. 05 18:39:31 nordi kernel: ata1: hard resetting link
nov. 05 18:39:31 nordi kernel: ata1: link is slow to respond, please be patient (ready=0)
nov. 05 18:39:31 nordi kernel: ata1: COMRESET failed (errno=-16)
nov. 05 18:39:31 nordi kernel: ata1: hard resetting link
nov. 05 18:39:31 nordi kernel: ata1: link is slow to respond, please be patient (ready=0)
nov. 05 18:39:31 nordi kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
nov. 05 18:39:31 nordi kernel: ata1.00: configured for UDMA/133
nov. 05 18:39:31 nordi kernel: ata1: EH complete

The Smartctl output (the disk doesn't apparently support self tests)

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   000   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       5929
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       1084
148 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
149 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
167 Write_Protect_Mode      0x0000   100   100   000    Old_age   Offline      -       0
168 SATA_Phy_Error_Count    0x0012   100   100   000    Old_age   Always       -       0
169 Bad_Block_Rate          0x0000   100   100   000    Old_age   Offline      -       8
170 Bad_Blk_Ct_Erl/Lat      0x0000   100   100   010    Old_age   Offline      -       0/19
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 MaxAvgErase_Ct          0x0000   100   100   000    Old_age   Offline      -       36 (Average 10)
181 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
182 Erase_Fail_Count        0x0000   100   100   000    Old_age   Offline      -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
192 Unsafe_Shutdown_Count   0x0012   100   100   000    Old_age   Always       -       27
194 Temperature_Celsius     0x0022   076   054   000    Old_age   Always       -       24 (Min/Max 17/46)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
199 SATA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
218 CRC_Error_Count         0x0032   100   100   000    Old_age   Always       -       1
231 SSD_Life_Left           0x0000   001   001   000    Old_age   Offline      -       99
233 Flash_Writes_GiB        0x0032   100   100   000    Old_age   Always       -       8958
241 Lifetime_Writes_GiB     0x0032   100   100   000    Old_age   Always       -       8592
242 Lifetime_Reads_GiB      0x0032   100   100   000    Old_age   Always       -       9277
244 Average_Erase_Count     0x0000   100   100   000    Old_age   Offline      -       10
245 Max_Erase_Count         0x0000   100   100   000    Old_age   Offline      -       36
246 Total_Erase_Count       0x0000   100   100   000    Old_age   Offline      -       906624

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

Selective Self-tests/Logging not supported

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Self-test functions not supported

seth · 2021-11-07 18:59:16

The leading errors (posted) are on the link and teh SMART data looks unsuspicious.
=> Broken or loose cable (check both ends and also power supply) or maybe the bus on the board.
Since you've more drives, i'd try to flip cables (on the drive so the "bad" one is connected to where a "good" one previously was) - if the error is then completely gone, switch back. If still gone, the cable was just loose.

ivanoff · 2021-11-07 21:27:41

Thank you. I will do that. I was also suspecting the internal drive controller (not the memory), but at this point, this could be anywhere.
BTW, I had tried to disconnect/reconnect the cable, but the issued continued.

Last edited by ivanoff (2021-11-07 21:29:12)

seth · 2021-11-08 06:51:10

If flipping the connection doesn't work, i's either the jack on the disk or the controller - in the latter case you also can't trust the smart data anymore.

ivanoff · 2021-11-13 20:53:23

Well, I did all sorts of tests, disk and cable switchings, and but the issue didn't come back. This may have been just the cable.
I am now testing the disk in turns with badblocks. Two proved already to be 100% A-OK, I will test the third one soon (recovery and data transfer takes time).

So all in all, it seems fine, except maybe a check I did with the array, that showed a lot of mismatch_cnt (300000), but I understand this can happen on a RAID1? The data doesn't seem corrupted (all my files are there), but it is always difficult to know for sure.

So this mismatch count is kind of weird. I prefer when things go wrong openly, and not silently.

seth · 2021-11-13 21:26:28

If fumbling w/ the cables "fixed" it, it was a loose cable or, worst case, a cable is broken and stutters under tension.
(Again: error points the link and the SMART data was unsuspicous)

There's no way to determine that ex post (or, realistically, ever)

Arch Linux

#1 2021-11-07 12:27:41

SSD or MB/Controller/cable issue ?

#2 2021-11-07 18:59:16

Re: SSD or MB/Controller/cable issue ?

#3 2021-11-07 21:27:41

Re: SSD or MB/Controller/cable issue ?

#4 2021-11-08 06:51:10

Re: SSD or MB/Controller/cable issue ?

#5 2021-11-13 20:53:23

Re: SSD or MB/Controller/cable issue ?

#6 2021-11-13 21:26:28

Re: SSD or MB/Controller/cable issue ?

Board footer