You are not logged in.

#1 2017-08-29 22:56:47

lllars
Member
Registered: 2013-05-17
Posts: 31

Hard drive crashing -- how to get error logs?

I'm having an occasional problem where my system is running fine and then suddenly I can't access the hard drive at all.  I can't save files from already running programs.  Trying to run any programs from the termial gives something like bash: input/output error.

The system works fine upon reboot, and I don't seem to be getting any data corruption.  Smartctl -a output looks fine.  Filesystem is XFS.  This crash has happened twice in the last week, and otherwise only one other time maybe 2-3 weeks ago.

The problem drive is a Samsung PM961, 1TB nvme drive and the system is a Lenovo P51 laptop.  There is another ssd in the laptop, part of which is partitioned for win10 and part of which is available empty space.

How can I get more info about what is happenning?  I've tried running journalctl -b -1 after rebooting, but there are no errors recorded there.  Should I be looking anywhere else for logs?  Are there any programs I could leave running in the background and then check when I notice a crash?

Last edited by lllars (2017-08-29 22:59:43)

Offline

#2 2017-08-29 23:15:40

loqs
Member
Registered: 2014-03-06
Posts: 17,327

Re: Hard drive crashing -- how to get error logs?

I would suggest checking the desk using S.M.A.R.T.
You could also use `journalctl -f` to follow the journal or `dmesg -w` to follow dmesg.

Offline

#3 2017-08-30 01:51:09

lllars
Member
Registered: 2013-05-17
Posts: 31

Re: Hard drive crashing -- how to get error logs?

Thanks.  I've now got  'journalctl -f' and 'dmesg -w' running in the background.  I don't think I can run smart tests because it is an nvme drive.

I suspect it is an issue related to APST.  There is a known issue with the PM951 (note that I've got a 961, not a 951).  It's just hard to troubleshoot with it happenning so infrequently.  See here for known issue: https://github.com/damige/linux-nvme/issues/2

Offline

#4 2017-08-30 07:44:00

V1del
Forum Moderator
Registered: 2012-10-16
Posts: 21,672

Re: Hard drive crashing -- how to get error logs?

You can run SMART tests on an nvme drive, they should support it as well.

Offline

#5 2017-08-30 14:05:43

lllars
Member
Registered: 2013-05-17
Posts: 31

Re: Hard drive crashing -- how to get error logs?

I don't appear to be able to run SMART tests.  Maybe I'm missing something?

$ sudo smartctl -t short /dev/nvme0
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.12.8-2-ARCH] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

NVMe device successfully opened

Use 'smartctl -a' (or '-x') to print SMART (and more) information

$ sudo smartctl -a /dev/nvme0
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.12.8-2-ARCH] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZVLW1T0HMLH-00000
Serial Number:                      S2U3NX0J103036
Firmware Version:                   CXY7301Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 1,024,209,543,168 [1.02 TB]
Unallocated NVM Capacity:           0
Controller ID:                      2
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,024,209,543,168 [1.02 TB]
Namespace 1 Utilization:            434,903,560,192 [434 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Wed Aug 30 10:02:18 2017 EDT
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL *Other*
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Warning  Comp. Temp. Threshold:     68 Celsius
Critical Comp. Temp. Threshold:     71 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     7.60W       -        -    0  0  0  0        0       0
 1 +     6.00W       -        -    1  1  1  1        0       0
 2 +     5.10W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1500
 4 -   0.0050W       -        -    4  4  4  4     2200    6000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning:                   0x00
Temperature:                        30 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    222,354 [113 GB]
Data Units Written:                 1,338,117 [685 GB]
Host Read Commands:                 7,515,904
Host Write Commands:                11,179,893
Controller Busy Time:               30
Power Cycles:                       483
Power On Hours:                     50
Unsafe Shutdowns:                   22
Media and Data Integrity Errors:    0
Error Information Log Entries:      172
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               30 Celsius
Temperature Sensor 2:               34 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0        172     0  0x0008  0x4004      -            0     0     -
  1        171     0  0x0008  0x4004      -            0     0     -
  2        170     0  0x0008  0x4004      -            0     0     -
  3        169     0  0x0008  0x4004  0x02c            0     0     -
  4        168     0  0x0022  0x4004  0x02c            0     0     -
  5        167     0  0x0021  0x4004  0x02c            0     0     -
  6        166     0  0x0008  0x4004      -            0     0     -
  7        165     0  0x0008  0x4004      -            0     0     -
  8        164     0  0x0008  0x4004      -            0     0     -
  9        163     0  0x0008  0x4004      -            0     0     -
 10        162     0  0x0008  0x4004      -            0     0     -
 11        161     0  0x0008  0x4004      -            0     0     -
 12        160     0  0x0008  0x4004      -            0     0     -
 13        159     0  0x0008  0x4004      -            0     0     -
 14        158     0  0x0008  0x4004      -            0     0     -
 15        157     0  0x0008  0x4004      -            0     0     -
... (48 entries not shown)

sudo smartctl -c /dev/nvme0
[sudo] password for lars: 
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.12.8-2-ARCH] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL *Other*
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Warning  Comp. Temp. Threshold:     68 Celsius
Critical Comp. Temp. Threshold:     71 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     7.60W       -        -    0  0  0  0        0       0
 1 +     6.00W       -        -    1  1  1  1        0       0
 2 +     5.10W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1500
 4 -   0.0050W       -        -    4  4  4  4     2200    6000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

Offline

Board footer

Powered by FluxBB