You are not logged in.

#1 2014-07-23 12:30:07

kozaki
Member
From: London >. < Paris
Registered: 2005-06-13
Posts: 671
Website

SSD freezes for a few days then returns to normal and smart OK

Hello

<EDIT>
The freezes have stopped as suddenly as they had started. Logs show all that all smart tests passed successfully for the last three days:
# journalctl |grep self-test

juil. 26 05:05:25 llewellyn smartd[442]: Device: /dev/sdb [SAT], previous self-test completed without error
juil. 28 02:41:32 llewellyn smartd[459]: Device: /dev/sda [SAT], previous self-test completed without error [x3]
juil. 29 02:41:32 llewellyn smartd[459]: Device: /dev/sda [SAT], previous self-test completed without error [x3]

Do you mind to share your interpretation of this behavior? Call me dumb but was this SSD getting tired of fasting (Ramadan ended yesterday) or has a strong personality... :-o ?
In the meantime have increased backups frequency.
</EDIT>

My system started to freeze from time to time with the error log below. But SMARTctl says drive's fine. Already happened six months ago then the problem vanished. Wonder how to effectively check a SSD's status?!
This workstation has two Sata Hard Drives and a SanDisk SDSSDX120GG25 15 months old.
Kernel~3.15.5-2-ck x86_64 Up~6 days Mem~1533.7/3770.6MB HDD~940.2GB(51.1% used).

1) smartctl (and hdsentinel) says drive's fine:

# smartctl -a /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.15.5-2-ck] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     SandForce Driven SSDs
Device Model:     SanDisk SDSSDX120GG25
Serial Number:    120645400154
LU WWN Device Id: 5 001b44 7229bb25a
Firmware Version: R211
User Capacity:    120 034 123 776 bytes [120 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS, ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jul 23 11:23:05 2014 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever
					been run.
# smartctl -A /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.15.5-2-ck] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   110   110   050    Pre-fail  Always       -       0/29297014
  5 Retired_Block_Count     0x0033   100   100   003    Pre-fail  Always       -       0
  9 Power_On_Hours_and_Msec 0x0032   088   088   000    Old_age   Always       -       11290h+19m+12.600s
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       180
171 Program_Fail_Count      0x0032   000   000   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   000   000   000    Old_age   Always       -       0
174 Unexpect_Power_Loss_Ct  0x0030   000   000   000    Old_age   Offline      -       66
177 Wear_Range_Delta        0x0000   000   000   000    Old_age   Offline      -       5
181 Program_Fail_Count      0x0032   000   000   000    Old_age   Always       -       0
182 Erase_Fail_Count        0x0032   000   000   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   032   035   000    Old_age   Always       -       32 (Min/Max 12/35)
195 ECC_Uncorr_Error_Count  0x001c   120   120   000    Old_age   Offline      -       0/29297014
196 Reallocated_Event_Count 0x0033   100   100   003    Pre-fail  Always       -       0
201 Unc_Soft_Read_Err_Rate  0x001c   120   120   000    Old_age   Offline      -       0/29297014
204 Soft_ECC_Correct_Rate   0x001c   120   120   000    Old_age   Offline      -       0/29297014
230 Life_Curve_Status       0x0013   100   100   000    Pre-fail  Always       -       100
231 SSD_Life_Left           0x0013   100   100   010    Pre-fail  Always       -       0
233 SandForce_Internal      0x0000   000   000   000    Old_age   Offline      -       6048
234 SandForce_Internal      0x0032   000   000   000    Old_age   Always       -       11062
241 Lifetime_Writes_GiB     0x0032   000   000   000    Old_age   Always       -       11062
242 Lifetime_Reads_GiB      0x0032   000   000   000    Old_age   Always       -       3152

2) the system freezes (for about 40 seconds) in what appears to be random, for 4 days:

$ dmesg
[511666.153499] ata1: EH complete
[511666.153512] EXT4-fs (dm-3): discard request in group:1055 block:7424 count:65 failed with -5
[511827.786342] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x50000 action 0x6 frozen
[511827.786347] ata1: SError: { PHYRdyChg CommWake }
[511827.786350] ata1.00: failed command: DATA SET MANAGEMENT
[511827.786354] ata1.00: cmd 06/01:01:00:00:00/00:00:00:00:00/a0 tag 4 dma 512 out
		 res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[511827.786357] ata1.00: status: { DRDY }
[511827.786360] ata1: hard resetting link
[511828.091280] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[511828.102562] ata1.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
[511828.102568] ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
[511828.102571] ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
[511828.122516] ata1.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
[511828.122521] ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
[511828.122525] ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
[511828.132926] ata1.00: configured for UDMA/133
[511828.133329] ata1.00: device reported invalid CHS sector 0
[511828.133341] sd 0:0:0:0: [sda]
[511828.133342] Result: hostbyte=0x00 driverbyte=0x08
[511828.133344] sd 0:0:0:0: [sda]
[511828.133345] Sense Key : 0xb [current] [descriptor]
[511828.133348] Descriptor sense data with sense descriptors (in hex):
[511828.133349]         72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
[511828.133356]         00 00 00 00
[511828.133359] sd 0:0:0:0: [sda]
[511828.133360] ASC=0x0 ASCQ=0x0
[511828.133362] sd 0:0:0:0: [sda] CDB:
[511828.133363] cdb[0]=0x93: 93 08 00 00 00 00 04 2e 24 c2 00 00 00 40 00 00
[511828.133372] end_request: I/O error, dev sda, sector 70132930
[511828.133389] EXT4-fs (dm-3): discard request in group:1055 block:3680 count:32 failed with -5
[511828.133390] ata1: EH complete
# journalctl -xb | grep sda
juil. 17 11:25:29 llewellyn smartd[451]: Device: /dev/sda [SAT], SanDisk SDSSDX120GG25, S/N:120645400154, WWN:5-001b44-7229bb25a, FW:R211, 120 GB
(...)
juil. 18 00:25:31 llewellyn smartd[451]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 108 to 109
(...)
juil. 19 02:25:31 llewellyn smartd[451]: Device: /dev/sda [SAT], starting scheduled Short Self-Test.
juil. 19 02:55:31 llewellyn smartd[451]: Device: /dev/sda [SAT], previous self-test completed without error
juil. 19 03:26:02 llewellyn smartd[451]: Device: /dev/sda [SAT], starting scheduled Long Self-Test.
juil. 19 03:55:31 llewellyn smartd[451]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 33 to 34
juil. 19 03:55:31 llewellyn smartd[451]: Device: /dev/sda [SAT], previous self-test could not complete due to a fatal or unknown error
(...)
juil. 19 11:23:09 llewellyn kernel: sd 0:0:0:0: [sda] CDB:
juil. 19 11:23:09 llewellyn kernel: end_request: I/O error, dev sda, sector 72829890
(...)
juil. 19 18:55:15 llewellyn kernel: sd 0:0:0:0: [sda] CDB:
juil. 19 18:55:15 llewellyn kernel: end_request: I/O error, dev sda, sector 71426050
(...)
juil. 19 22:14:50 llewellyn kernel: sd 0:0:0:0: [sda] CDB:
juil. 19 22:14:50 llewellyn kernel: end_request: I/O error, dev sda, sector 72564994
(...)
juil. 19 23:19:04 llewellyn kernel: sd 0:0:0:0: [sda] CDB:
juil. 19 23:19:04 llewellyn kernel: end_request: I/O error, dev sda, sector 56920066
(...)
juil. 20 03:51:16 llewellyn kernel: sd 0:0:0:0: [sda] CDB:
juil. 20 03:51:16 llewellyn kernel: end_request: I/O error, dev sda, sector 61002754
(...)
juil. 20 11:36:13 llewellyn kernel: sd 0:0:0:0: [sda] CDB:
juil. 20 11:36:13 llewellyn kernel: end_request: I/O error, dev sda, sector 56936898
(...)
juil. 21 22:40:37 llewellyn kernel: sd 0:0:0:0: [sda] CDB:
juil. 21 22:40:37 llewellyn kernel: end_request: I/O error, dev sda, sector 64332610
juil. 21 22:40:37 llewellyn kernel: sd 0:0:0:0: [sda] CDB:
juil. 21 22:40:37 llewellyn kernel: end_request: I/O error, dev sda, sector 64332610
(...)
juil. 21 22:56:40 llewellyn kernel: sd 0:0:0:0: [sda] CDB:
juil. 21 22:56:40 llewellyn kernel: end_request: I/O error, dev sda, sector 64332226
(...)
juil. 23 06:51:20 llewellyn kernel: sd 0:0:0:0: [sda] CDB:
juil. 23 06:51:20 llewellyn kernel: end_request: I/O error, dev sda, sector 63981570
(...)
juil. 23 09:33:24 llewellyn kernel: sd 0:0:0:0: [sda] CDB:
juil. 23 09:33:24 llewellyn kernel: end_request: I/O error, dev sda, sector 70140418
(...)
juil. 23 09:36:06 llewellyn kernel: sd 0:0:0:0: [sda] CDB:
juil. 23 09:36:06 llewellyn kernel: end_request: I/O error, dev sda, sector 70132930

Six months ago I got the same issue. Disappeared after I upgraded one the Sata Drives' firmware (not the SSD's). strange enough.

Please help me learn how to check the real state of a SSD?. Hard to move on when you dunno if a drive's issue is hardware related hmm

Useful References I checked but couldn't learn how to check what's causing the system's freeze:
* checking SMART's statuses or conditions by GrapefruiTgirl
* Disabling TRIM on the kernel level (Unresolved) The "failed command: DATA SET MANAGEMENT" shows that the kernel hangs while trying to issue a TRIM command
* http://www.linuxjournal.com/magazine/mo … t?page=0,0

Last edited by kozaki (2014-07-29 16:47:19)


Seeded last month: Arch 50 gig, derivatives 1 gig
Desktop @3.3GHz 8 gig RAM, linux-ck
laptop #1 Atom 2 gig RAM, Arch linux stock i686 (6H w/ 6yrs old battery smile) #2: ARM Tegra K1, 4 gig RAM, ChrOS
Atom Z520 2 gig RAM, OMV (Debian 7) kernel 3.16 bpo on SDHC | PGP Key: 0xFF0157D9

Offline

#2 2014-07-26 10:20:24

kozaki
Member
From: London >. < Paris
Registered: 2005-06-13
Posts: 671
Website

Re: SSD freezes for a few days then returns to normal and smart OK

What I find strange is that only one on the ten latest (long?) self-tests couldn't complete, all others went fine. Which one to follow is unclear to me:

juil. 25 02:35:26 llewellyn smartd[442]: Device: /dev/sdc [SAT], previous self-test completed without error ( x6 )
juil. 26 02:35:25 llewellyn smartd[442]: Device: /dev/sda [SAT], previous self-test could not complete due to a fatal or unknown error
juil. 26 02:35:25 llewellyn smartd[442]: Device: /dev/sdb [SAT], previous self-test completed without error
juil. 26 02:35:25 llewellyn smartd[442]: Device: /dev/sdc [SAT], previous self-test completed without error
juil. 26 03:35:25 llewellyn smartd[442]: Device: /dev/sda [SAT], self-test in progress, 10% remaining
juil. 26 03:35:25 llewellyn smartd[442]: Device: /dev/sdb [SAT], self-test in progress, 70% remaining
juil. 26 03:35:25 llewellyn smartd[442]: Device: /dev/sdc [SAT], self-test in progress, 60% remaining
juil. 26 04:05:25 llewellyn smartd[442]: Device: /dev/sda [SAT], previous self-test completed without error
juil. 26 04:05:25 llewellyn smartd[442]: Device: /dev/sdb [SAT], self-test in progress, 40% remaining
juil. 26 04:05:25 llewellyn smartd[442]: Device: /dev/sdc [SAT], self-test in progress, 20% remaining
juil. 26 04:35:25 llewellyn smartd[442]: Device: /dev/sdb [SAT], self-test in progress, 10% remaining
juil. 26 04:35:26 llewellyn smartd[442]: Device: /dev/sdc [SAT], previous self-test completed without error ( x2 )

Seeded last month: Arch 50 gig, derivatives 1 gig
Desktop @3.3GHz 8 gig RAM, linux-ck
laptop #1 Atom 2 gig RAM, Arch linux stock i686 (6H w/ 6yrs old battery smile) #2: ARM Tegra K1, 4 gig RAM, ChrOS
Atom Z520 2 gig RAM, OMV (Debian 7) kernel 3.16 bpo on SDHC | PGP Key: 0xFF0157D9

Offline

Board footer

Powered by FluxBB