You are not logged in.

#1 2017-04-02 23:58:38

AndreasGB
Member
Registered: 2016-01-05
Posts: 18

COMRESET of HDD fails during running System, how to proceed?

Dear Community,

yesterday when writing a document, saving failed as I was informed that my whole filesystem is read only, which it was not before, looking at journalctl -xe, I concluded that something went wrong with the hard disk.
The next step I took was to reboot the Laptop (T440p), and during boot as it tried to recover the file system journal, I was prompted to manually run fsck, which found some errors but was able to correct them. After that, the computer booted again and I immediately ran a backup of my files.

Thinking that this might have been a one time event, I continued using the machine, but today the same error occurred. This time I took a photo of it which I will attach to the post, no saved version available though because chrome crashed when I tried to open pastebin and I obviously could not write to disk anymore.

I have run lsblk and cat /proc/scsi/scsi after rebooting and running fsck again to see which drive was the failing one, and what I gather from it is that the 1 TB HDD is the culprit.

╭─andreas@Dagger ~  
╰─➤  lsblk
NAME                          MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                             8:0    0 931,5G  0 disk  
├─sda1                          8:1    0   512M  0 part  
├─sda2                          8:2    0     1G  0 part  /boot
└─sda3                          8:3    0   930G  0 part  
  └─bcache0                   254:0    0   930G  0 disk  
    └─lvm                     253:0    0   930G  0 crypt 
      ├─DaggerStorage-swapvol 253:1    0     8G  0 lvm   [SWAP]
      ├─DaggerStorage-rootvol 253:2    0    30G  0 lvm   /
      └─DaggerStorage-homevol 253:3    0   892G  0 lvm   /home
sdb                             8:16   0 119,2G  0 disk  
└─bcache0                     254:0    0   930G  0 disk  
  └─lvm                       253:0    0   930G  0 crypt 
    ├─DaggerStorage-swapvol   253:1    0     8G  0 lvm   [SWAP]
    ├─DaggerStorage-rootvol   253:2    0    30G  0 lvm   /
    └─DaggerStorage-homevol   253:3    0   892G  0 lvm   /home
sr0                            11:0    1  1024M  0 rom   
╭─andreas@Dagger ~  
╰─➤  cat /proc/scsi/scsi 
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: ST1000LM024 HN-M Rev: 0001
  Type:   Direct-Access                    ANSI  SCSI revision: 05
Host: scsi1 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: TS128GMTS400     Rev: 6I  
  Type:   Direct-Access                    ANSI  SCSI revision: 05
Host: scsi5 Channel: 00 Id: 00 Lun: 00
  Vendor: HL-DT-ST Model: DVDRAM GU70N     Rev: LS20
  Type:   CD-ROM                           ANSI  SCSI revision: 05
╭─andreas@Dagger ~  
╰─➤

Am I right in what I assumed and did until now or should I do something different or when this happens the next time?

How to proceed from this? I ordered a new 1 TB drive, will I be fine just dd'ing the current HDD to the new one or is there something I should be careful about?

Thank you very much for your input and help smile

Image of journalctl -xe today: http://i.imgur.com/qQyiO9b.jpg

Offline

#2 2017-04-03 08:50:53

R00KIE
Forum Fellow
From: Between a computer and a chair
Registered: 2008-09-14
Posts: 4,734

Re: COMRESET of HDD fails during running System, how to proceed?

Post the output of 'smartctl -a /dev/sdX' for both sda and sdb.


R00KIE
Tm90aGluZyB0byBzZWUgaGVyZSwgbW92ZSBhbG9uZy4K

Offline

#3 2017-04-03 09:29:07

mich41
Member
Registered: 2012-06-22
Posts: 796

Re: COMRESET of HDD fails during running System, how to proceed?

AndreasGB wrote:

I obviously could not write to disk

Pendrives come in handy in such cases.

It looks like the disk is simply disappearing for some reason. My first thought was SSD firmware bug but you say it's spinning rust? Well, besides 'smartctl -a' try also 'smartctl -t short' and see what comes out of it. Check the disk's power cable.

Replacement probably will fix it (unless it's motherboard's fault, unlikely IMO).

You can dd one disk onto another provided that they are the exact same size or the new one is larger, otherwise you will wait few hours only to see "write error: out of space".

Offline

#4 2017-04-03 10:05:24

AndreasGB
Member
Registered: 2016-01-05
Posts: 18

Re: COMRESET of HDD fails during running System, how to proceed?

Thanks R00KIE and mich41 for your answers, I ran the commands asked by you. smartctl -a /dev/sda:

╭─andreas@Dagger ~  
╰─➤  sudo smartctl -a /dev/sda
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.10.4-1-ARCH] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Samsung SpinPoint M8 (AF)
Device Model:     ST1000LM024 HN-M101MBB
Serial Number:    <removed>
LU WWN Device Id: <removed>
Firmware Version: 2BA30001
User Capacity:    1.000.204.886.016 bytes [1,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Apr  3 11:48:04 2017 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(12660) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 211) minutes.
SCT capabilities: 	       (0x003f)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       1
  2 Throughput_Performance  0x0026   252   252   000    Old_age   Always       -       0
  3 Spin_Up_Time            0x0023   092   090   025    Pre-fail  Always       -       2459
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       551
  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0024   252   252   015    Old_age   Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       2158
 10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       38
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       560
191 G-Sense_Error_Rate      0x0022   100   100   000    Old_age   Always       -       27
192 Power-Off_Retract_Count 0x0022   100   100   000    Old_age   Always       -       31
194 Temperature_Celsius     0x0002   064   055   000    Old_age   Always       -       21 (Min/Max 17/49)
195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   252   252   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   252   252   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       5553
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       38
225 Load_Cycle_Count        0x0032   078   078   000    Old_age   Always       -       230509

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Completed [00% left] (0-65535)
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

smartctl -a /dev/sdb:

╭─andreas@Dagger ~  
╰─➤  sudo smartctl -a /dev/sdb
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.10.4-1-ARCH] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     TS128GMTS400
Serial Number:    <removed>
Firmware Version: N1126I
User Capacity:    128.035.676.160 bytes [128 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Apr  3 11:48:34 2017 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x71) SMART execute Offline immediate.
					No Auto Offline data collection support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0002)	Does not save SMART data before
					entering power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 (   1) minutes.
Conveyance self-test routine
recommended polling time: 	 (   1) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0000   100   100   000    Old_age   Offline      -       0
  5 Reallocated_Sector_Ct   0x0000   100   100   000    Old_age   Offline      -       0
  9 Power_On_Hours          0x0000   100   100   000    Old_age   Offline      -       454
 12 Power_Cycle_Count       0x0000   100   100   000    Old_age   Offline      -       842
160 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
161 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       44
163 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       25
164 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       110522
165 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       165
166 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       62
167 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       108
168 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       3000
169 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       97
175 Program_Fail_Count_Chip 0x0000   100   100   000    Old_age   Offline      -       0
176 Erase_Fail_Count_Chip   0x0000   100   100   000    Old_age   Offline      -       0
177 Wear_Leveling_Count     0x0000   100   100   050    Old_age   Offline      -       367
178 Used_Rsvd_Blk_Cnt_Chip  0x0000   100   100   000    Old_age   Offline      -       0
181 Program_Fail_Cnt_Total  0x0000   100   100   000    Old_age   Offline      -       0
182 Erase_Fail_Count_Total  0x0000   100   100   000    Old_age   Offline      -       0
192 Power-Off_Retract_Count 0x0000   100   100   000    Old_age   Offline      -       47
194 Temperature_Celsius     0x0000   100   100   000    Old_age   Offline      -       11
195 Hardware_ECC_Recovered  0x0000   100   100   000    Old_age   Offline      -       64449
196 Reallocated_Event_Count 0x0000   100   100   016    Old_age   Offline      -       0
197 Current_Pending_Sector  0x0000   100   100   000    Old_age   Offline      -       0
198 Offline_Uncorrectable   0x0000   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0000   100   100   050    Old_age   Offline      -       0
232 Available_Reservd_Space 0x0000   100   100   000    Old_age   Offline      -       100
241 Total_LBAs_Written      0x0000   100   100   000    Old_age   Offline      -       56400
242 Total_LBAs_Read         0x0000   100   100   000    Old_age   Offline      -       46096
245 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       442088

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
    6        0    65535  Read_scanning was never started
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

self-test of sda:

╭─andreas@Dagger ~  
╰─➤  sudo smartctl -t short /dev/sda
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.10.4-1-ARCH] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Mon Apr  3 11:54:53 2017

Use smartctl -X to abort test.
╭─andreas@Dagger ~  
╰─➤  sudo smartctl -l selftest /dev/sda
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.10.4-1-ARCH] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      2158         -

and of sdb:

╭─andreas@Dagger ~  
╰─➤  sudo smartctl -t short /dev/sdb   
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.10.4-1-ARCH] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 1 minutes for test to complete.
Test will complete after Mon Apr  3 11:56:48 2017

Use smartctl -X to abort test.
╭─andreas@Dagger ~  
╰─➤  sudo smartctl -l selftest /dev/sdb
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.10.4-1-ARCH] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       198         -

I cannot really check its power cable since the disk is held in a tiny frame to a SATA and Power connector, but that was still screwed in place.

Since you mention faulty firmware mitch41, I did upgrade the Firmware of the Laptop, but that was on March 25th and it worked after that.
Also it sometimes does not wake from sleep, but restart by holding down power has worked.
The logs never showed any reason for the crash after restart (which now makes me think about whether this may be caused by the disappearing HDD, but probably unlikely too.)

Last edited by AndreasGB (2017-04-03 15:11:53)

Offline

#5 2017-04-03 13:10:27

R00KIE
Forum Fellow
From: Between a computer and a chair
Registered: 2008-09-14
Posts: 4,734

Re: COMRESET of HDD fails during running System, how to proceed?

From your 'smartctl -a' output I'd say both disk are not going to fail right away but you have to do a long test, short smart tests do not scan the whole disk surface and can miss some problems.

Your sdb is an SSD, so that should be quite fast to do the long test, your sda is another matter. I have no experience with Toshiba SSDs (which seems to be the brand of your SSD) but so far I've had only bad experiences with Samsung hard disks (even if it says Seagate on the sticker it seems it was bought from Samsung - I've lost track of who bought who so I suppose Seagate may have bough Samsung's spinning rust division and is selling some left over Samsung designs as Seagate's).

My advice is, backup all important data. Then issue a long test to your sda, after that finishes get all the smart parameters with 'smartctl -a'. Then for good measure issue an offline test and get all smart parameters again after the test finishes. Given your problems I would expect to see values above zero for smart parameters 197 and/or 198 for either/both tests. You can also do this for sdb just for peace of mind and for being thorough with your tests.

Regarding firmware, make sure the firmware for all disks is up-to-date as some newer revisions might include important fixes. Also make sure you don't have anything periodically getting the smart parameters from disks, at least some Samsung spinning rust disks have firmware bugs that can lead to data corruption when smart parameters are probed and there is still data in buffers waiting to be flushed to disk.


R00KIE
Tm90aGluZyB0byBzZWUgaGVyZSwgbW92ZSBhbG9uZy4K

Offline

#6 2017-04-03 15:29:13

mich41
Member
Registered: 2012-06-22
Posts: 796

Re: COMRESET of HDD fails during running System, how to proceed?

R00KIE wrote:

Regarding firmware, make sure the firmware for all disks is up-to-date

To make things clear, we mean disk firmware, the thing running on the tiny CPU inside the disk, not laptop firmware (BIOS/UEFI/whatever). You get this from the disk manufacturer's website. But I have to say I don't think it's firmware's fault if it started only after 2000+ hours.

Does the disk keep spinning afterwards? Any other sound from it?

Offline

Board footer

Powered by FluxBB