You are not logged in.

#1 2023-03-04 23:49:01

Bumble
Member
Registered: 2016-12-04
Posts: 18

[SOLVED] Frequent disk errors and SMART output not indicating any issu

I'm trying to determine if my disk is dying. It's 7 years old, I wouldn't be surprised if its got some flaws, but the results are confusing.

Smartctl is showing nothing seems to be wrong, but I get disk hangs ~1-2 times  a day where my iowait spikes at random times. During the iowait spikes everything hangs, for example if I open a new terminal the cursor will sit at the top and blink but the machine/user name doesn't show yet and i can't get commands through. It can last 10-20 seconds. The journal points to what seems to be disk errors, but I'm not sure how to interpret it, my smart output doesn't seem to have counted errors but the available tests are passed.

Is the smart result unreliable here? I assume it might be failing in ways not covered by the smart tests

Smart output:

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.2.2-arch1-1] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     LITEON IT L8T-256L9G
Serial Number:    002452106954
Firmware Version: H881202
User Capacity:    256,060,514,304 bytes [256 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
TRIM Command:     Available
Device is:        Not in smartctl database 7.3/5319
ATA Version is:   ATA8-ACS, ATA/ATAPI-7 T13/1532D revision 4a
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Mar  4 18:32:23 2023 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever
					been run.
Total time to complete Offline
data collection: 		(   10) seconds.
Offline data collection
capabilities: 			(0x15) SMART execute Offline immediate.
					No Auto Offline data collection support.
					Abort Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					No Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine
recommended polling time: 	(   1) minutes.
Extended self-test routine
recommended polling time: 	(  10) minutes.
SCT capabilities: 	      (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0003   100   100   000    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0002   100   100   000    Old_age   Always       -       7591
 12 Power_Cycle_Count       0x0003   100   100   000    Pre-fail  Always       -       4437
177 Wear_Leveling_Count     0x0003   100   100   000    Pre-fail  Always       -       207531
178 Used_Rsvd_Blk_Cnt_Chip  0x0003   100   100   000    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0003   100   100   000    Pre-fail  Always       -       0
182 Erase_Fail_Count_Total  0x0003   100   100   000    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0003   100   100   000    Pre-fail  Always       -       0
188 Command_Timeout         0x0003   100   100   000    Pre-fail  Always       -       202
189 Unknown_SSD_Attribute   0x0003   100   100   000    Pre-fail  Always       -       302
191 Unknown_SSD_Attribute   0x0003   100   100   000    Pre-fail  Always       -       0
192 Power-Off_Retract_Count 0x0003   100   100   000    Pre-fail  Always       -       91
196 Reallocated_Event_Count 0x0003   100   100   000    Pre-fail  Always       -       0
198 Offline_Uncorrectable   0x0003   100   100   000    Pre-fail  Always       -       0
199 UDMA_CRC_Error_Count    0x0003   100   100   000    Pre-fail  Always       -       2933
232 Available_Reservd_Space 0x0003   100   100   010    Pre-fail  Always       -       0
241 Total_LBAs_Written      0x0003   100   100   000    Pre-fail  Always       -       574081
242 Total_LBAs_Read         0x0003   100   100   000    Pre-fail  Always       -       604093

SMART Error Log Version: 0
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      7590         -
# 2  Extended offline    Completed without error       00%      6974         -
# 3  Extended offline    Completed without error       00%      6184         -
# 4  Short offline       Completed without error       00%      4750         -
# 5  Extended offline    Completed without error       00%      1478         -
# 6  Short offline       Completed without error       00%      1376         -
# 7  Short offline       Completed without error       00%         0         -
# 8  Short offline       Completed without error       00%         0         -
# 9  Short offline       Completed without error       00%         0         -

Selective Self-tests/Logging not supported

the logs vary, saying similar yet different things every time

example journal entry #1:

Mar 04 17:56:27 box kernel: ata5.00: exception Emask 0x0 SAct 0xca001 SErr 0x50000 action 0x6 frozen
Mar 04 17:56:27 box kernel: ata5: SError: { PHYRdyChg CommWake }
Mar 04 17:56:27 box kernel: ata5.00: failed command: READ FPDMA QUEUED
Mar 04 17:56:27 box kernel: ata5.00: cmd 60/00:00:e0:ac:58/01:00:06:00:00/40 tag 0 ncq dma 131072 in
                                     res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 04 17:56:27 box kernel: ata5.00: status: { DRDY }
Mar 04 17:56:27 box kernel: ata5.00: failed command: READ FPDMA QUEUED
Mar 04 17:56:27 box kernel: ata5.00: cmd 60/00:68:e0:ad:58/01:00:06:00:00/40 tag 13 ncq dma 131072 in
                                     res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 04 17:56:27 box kernel: ata5.00: status: { DRDY }
Mar 04 17:56:27 box kernel: ata5.00: failed command: READ FPDMA QUEUED
Mar 04 17:56:27 box kernel: ata5.00: cmd 60/08:78:08:50:91/00:00:14:00:00/40 tag 15 ncq dma 4096 in
                                     res 40/00:fe:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Mar 04 17:56:27 box kernel: ata5.00: status: { DRDY }
Mar 04 17:56:27 box kernel: ata5.00: failed command: WRITE FPDMA QUEUED
Mar 04 17:56:27 box kernel: ata5.00: cmd 61/10:90:c0:28:e0/00:00:11:00:00/40 tag 18 ncq dma 8192 out
                                     res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 04 17:56:27 box kernel: ata5.00: status: { DRDY }
Mar 04 17:56:27 box kernel: ata5.00: failed command: WRITE FPDMA QUEUED
Mar 04 17:56:27 box kernel: ata5.00: cmd 61/30:98:d0:28:e0/00:00:11:00:00/40 tag 19 ncq dma 24576 out
                                     res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Mar 04 17:56:27 box kernel: ata5.00: status: { DRDY }
Mar 04 17:56:27 box kernel: ata5: hard resetting link
Mar 04 17:56:27 box kernel: ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Mar 04 17:56:27 box kernel: ata5.00: configured for UDMA/133
Mar 04 17:56:27 box kernel: ata5.00: device reported invalid CHS sector 0
Mar 04 17:56:27 box kernel: sd 4:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=44s
Mar 04 17:56:27 box kernel: sd 4:0:0:0: [sda] tag#0 Sense Key : Illegal Request [current] 
Mar 04 17:56:27 box kernel: sd 4:0:0:0: [sda] tag#0 Add. Sense: Unaligned write command
Mar 04 17:56:27 box kernel: sd 4:0:0:0: [sda] tag#0 CDB: Read(10) 28 00 06 58 ac e0 00 01 00 00
Mar 04 17:56:27 box kernel: I/O error, dev sda, sector 106474720 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 2
Mar 04 17:56:27 box kernel: sd 4:0:0:0: [sda] tag#13 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=39s
Mar 04 17:56:27 box kernel: sd 4:0:0:0: [sda] tag#13 Sense Key : Illegal Request [current] 
Mar 04 17:56:27 box kernel: sd 4:0:0:0: [sda] tag#13 Add. Sense: Unaligned write command
Mar 04 17:56:27 box kernel: sd 4:0:0:0: [sda] tag#13 CDB: Read(10) 28 00 06 58 ad e0 00 01 00 00
Mar 04 17:56:27 box kernel: I/O error, dev sda, sector 106474976 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 2
Mar 04 17:56:27 box kernel: ata5: EH complete

example journal entry #2:

Feb 28 21:01:33 box kernel: ata5.00: exception Emask 0x0 SAct 0x7fcc0070 SErr 0x40000 action 0x6 frozen
Feb 28 21:01:33 box kernel: ata5: SError: { CommWake }
Feb 28 21:01:33 box kernel: ata5.00: failed command: WRITE FPDMA QUEUED
Feb 28 21:01:33 box kernel: ata5.00: cmd 61/10:20:10:c5:ef/00:00:11:00:00/40 tag 4 ncq dma 8192 out
                                     res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 28 21:01:33 box kernel: ata5.00: status: { DRDY }
Feb 28 21:01:33 box kernel: ata5.00: failed command: WRITE FPDMA QUEUED
Feb 28 21:01:33 box kernel: ata5.00: cmd 61/18:28:30:c5:ef/00:00:11:00:00/40 tag 5 ncq dma 12288 out
                                     res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 28 21:01:33 box kernel: ata5.00: status: { DRDY }
Feb 28 21:01:33 box kernel: ata5.00: failed command: WRITE FPDMA QUEUED
Feb 28 21:01:33 box kernel: ata5.00: cmd 61/08:30:20:c5:ef/00:00:11:00:00/40 tag 6 ncq dma 4096 out
                                     res 40/00:01:97:10:a9/00:00:02:00:00/40 Emask 0x4 (timeout)
Feb 28 21:01:33 box kernel: ata5.00: status: { DRDY }
Feb 28 21:01:33 box kernel: ata5.00: failed command: WRITE FPDMA QUEUED
Feb 28 21:01:33 box kernel: ata5.00: cmd 61/10:90:f0:78:19/00:00:02:00:00/40 tag 18 ncq dma 8192 out
                                     res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 28 21:01:33 box kernel: ata5.00: status: { DRDY }
Feb 28 21:01:33 box kernel: ata5.00: failed command: WRITE FPDMA QUEUED
Feb 28 21:01:33 box kernel: ata5.00: cmd 61/08:98:90:10:a9/00:00:02:00:00/40 tag 19 ncq dma 4096 out
                                     res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 28 21:01:33 box kernel: ata5.00: status: { DRDY }
Feb 28 21:01:33 box kernel: ata5.00: failed command: WRITE FPDMA QUEUED
Feb 28 21:01:33 box kernel: ata5.00: cmd 61/08:b0:a8:50:11/00:00:06:00:00/40 tag 22 ncq dma 4096 out
                                     res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 28 21:01:33 box kernel: ata5.00: status: { DRDY }
Feb 28 21:01:33 box kernel: ata5.00: failed command: WRITE FPDMA QUEUED
Feb 28 21:01:33 box kernel: ata5.00: cmd 61/08:b8:80:50:11/00:00:06:00:00/40 tag 23 ncq dma 4096 out
                                     res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 28 21:01:33 box kernel: ata5.00: status: { DRDY }
Feb 28 21:01:33 box kernel: ata5.00: failed command: WRITE FPDMA QUEUED
Feb 28 21:01:33 box kernel: ata5.00: cmd 61/08:c0:30:50:11/00:00:06:00:00/40 tag 24 ncq dma 4096 out
                                     res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 28 21:01:33 box kernel: ata5.00: status: { DRDY }
Feb 28 21:01:33 box kernel: ata5.00: failed command: WRITE FPDMA QUEUED
Feb 28 21:01:33 box kernel: ata5.00: cmd 61/08:c8:78:50:51/00:00:1a:00:00/40 tag 25 ncq dma 4096 out
                                     res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 28 21:01:33 box kernel: ata5.00: status: { DRDY }
Feb 28 21:01:33 box kernel: ata5.00: failed command: WRITE FPDMA QUEUED
Feb 28 21:01:33 box kernel: ata5.00: cmd 61/08:d0:38:50:91/00:00:15:00:00/40 tag 26 ncq dma 4096 out
                                     res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 28 21:01:33 box kernel: ata5.00: status: { DRDY }
Feb 28 21:01:33 box kernel: ata5.00: failed command: WRITE FPDMA QUEUED
Feb 28 21:01:33 box kernel: ata5.00: cmd 61/08:d8:78:50:51/00:00:15:00:00/40 tag 27 ncq dma 4096 out
                                     res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 28 21:01:33 box kernel: ata5.00: status: { DRDY }
Feb 28 21:01:33 box kernel: ata5.00: failed command: WRITE FPDMA QUEUED
Feb 28 21:01:33 box kernel: ata5.00: cmd 61/08:e0:c0:6f:95/00:00:13:00:00/40 tag 28 ncq dma 4096 out
                                     res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 28 21:01:33 box kernel: ata5.00: status: { DRDY }
Feb 28 21:01:33 box kernel: ata5.00: failed command: WRITE FPDMA QUEUED
Feb 28 21:01:33 box kernel: ata5.00: cmd 61/08:e8:c0:47:94/00:00:13:00:00/40 tag 29 ncq dma 4096 out
                                     res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 28 21:01:33 box kernel: ata5.00: status: { DRDY }
Feb 28 21:01:33 box kernel: ata5.00: failed command: WRITE FPDMA QUEUED
Feb 28 21:01:33 box kernel: ata5.00: cmd 61/08:f0:28:c5:ef/00:00:11:00:00/40 tag 30 ncq dma 4096 out
                                     res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb 28 21:01:33 box kernel: ata5.00: status: { DRDY }
Feb 28 21:01:33 box kernel: ata5: hard resetting link
Feb 28 21:01:33 box kernel: ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Feb 28 21:01:33 box kernel: ata5.00: configured for UDMA/133
Feb 28 21:01:33 box kernel: ata5.00: device reported invalid CHS sector 0
Feb 28 21:01:33 box kernel: ata5.00: device reported invalid CHS sector 0
Feb 28 21:01:33 box kernel: ata5.00: device reported invalid CHS sector 0
Feb 28 21:01:33 box kernel: ata5.00: device reported invalid CHS sector 0
Feb 28 21:01:33 box kernel: ata5.00: device reported invalid CHS sector 0
Feb 28 21:01:33 box kernel: ata5.00: device reported invalid CHS sector 0
Feb 28 21:01:33 box kernel: ata5.00: device reported invalid CHS sector 0
Feb 28 21:01:33 box kernel: ata5.00: device reported invalid CHS sector 0
Feb 28 21:01:33 box kernel: ata5.00: device reported invalid CHS sector 0
Feb 28 21:01:33 box kernel: ata5.00: device reported invalid CHS sector 0
Feb 28 21:01:33 box kernel: ata5.00: device reported invalid CHS sector 0
Feb 28 21:01:33 box kernel: ata5.00: device reported invalid CHS sector 0
Feb 28 21:01:33 box kernel: ata5.00: device reported invalid CHS sector 0
Feb 28 21:01:33 box kernel: ata5: EH complete

Last edited by Bumble (2023-03-09 02:57:21)

Offline

#2 2023-03-04 23:56:22

Trilby
Inspector Parrot
Registered: 2011-11-29
Posts: 29,442
Website

Re: [SOLVED] Frequent disk errors and SMART output not indicating any issu

Looks a lot like a cable issue to me - though I'm very far from an expert on such things (ewaller may chime in).

Open it up and check the cables - or just unplug/replug the drive cables at each end.

I don't know how SMART works under the hood, but it's reasonable to suspect that it ignores link errors at the OS level of commands not getting to the disk as it may focus just on assessing how well the disk handles data that gets to it.

Last edited by Trilby (2023-03-04 23:59:16)


"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" -  Richard Stallman

Offline

#3 2023-03-05 01:16:23

Bumble
Member
Registered: 2016-12-04
Posts: 18

Re: [SOLVED] Frequent disk errors and SMART output not indicating any issu

Trilby wrote:

Looks a lot like a cable issue to me - though I'm very far from an expert on such things (ewaller may chime in).

Open it up and check the cables - or just unplug/replug the drive cables at each end.

I don't know how SMART works under the hood, but it's reasonable to suspect that it ignores link errors at the OS level of commands not getting to the disk as it may focus just on assessing how well the disk handles data that gets to it.

That wouldn't be so bad, thanks for your input. I'll hold off a bit to see if anyone else has any additional wisdom. I've never cracked open laptops before but I suppose I might need to try soon.

Offline

#4 2023-03-05 01:45:59

Trilby
Inspector Parrot
Registered: 2011-11-29
Posts: 29,442
Website

Re: [SOLVED] Frequent disk errors and SMART output not indicating any issu

What kind of laptop is it?  If it's an IBM or Lenovo, go for it - they're basically industrial strength lego kits.  If it's a apple / macbook, don't even consider it - you'd sooner be able to put back together an egg after cooking the omlet.  Most other brands are somewhere between these two extremes.

Last edited by Trilby (2023-03-05 01:47:09)


"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" -  Richard Stallman

Offline

#5 2023-03-05 02:32:46

Bumble
Member
Registered: 2016-12-04
Posts: 18

Re: [SOLVED] Frequent disk errors and SMART output not indicating any issu

It's an acer. I've found videos and instruction sets of people taking apart this exact model. It shouldn't be too bad.

Offline

#6 2023-03-05 07:49:20

seth
Member
Registered: 2012-09-03
Posts: 49,979

Re: [SOLVED] Frequent disk errors and SMART output not indicating any issu

Mar 04 17:56:27 box kernel: ata5: SError: { PHYRdyChg CommWake }

https://wiki.archlinux.org/title/Solid_ … ted_errors ?

Offline

#7 2023-03-05 21:11:17

Bumble
Member
Registered: 2016-12-04
Posts: 18

Re: [SOLVED] Frequent disk errors and SMART output not indicating any issu

seth wrote:
Mar 04 17:56:27 box kernel: ata5: SError: { PHYRdyChg CommWake }

https://wiki.archlinux.org/title/Solid_ … ted_errors ?

i wasn't using tlp at all, but i've enabled max_performance so i suppose i'll know in a day or two

Offline

#8 2023-03-05 21:18:53

seth
Member
Registered: 2012-09-03
Posts: 49,979

Re: [SOLVED] Frequent disk errors and SMART output not indicating any issu

Did you enable ALPM by other means (powertop, laptop-mode-tools, etc)?
What was the value of /sys/class/scsi_host/host*/link_power_management_policy ?

Offline

#9 2023-03-05 23:03:09

Bumble
Member
Registered: 2016-12-04
Posts: 18

Re: [SOLVED] Frequent disk errors and SMART output not indicating any issu

seth wrote:

Did you enable ALPM by other means (powertop, laptop-mode-tools, etc)?
What was the value of /sys/class/scsi_host/host*/link_power_management_policy ?

without tlp is

med_power_with_dipm
med_power_with_dipm
med_power_with_dipm
med_power_with_dipm
med_power_with_dipm
med_power_with_dipm

after starting tlp.service, which is what i'm currently trying

max_performance
max_performance
max_performance
max_performance
max_performance
max_performance

edit: i missed your first question somehow, no i don't recall knowing of any way that it would become enabled, this is a device i keep plugged in so i don't need power saving controls

edit 2: i haven't been running into the same problem, i'm going to mark this as solved. this seems to have done the trick, thank you

Last edited by Bumble (2023-03-09 02:56:21)

Offline

Board footer

Powered by FluxBB