You are not logged in.
I'm having an occasional problem where my system is running fine and then suddenly I can't access the hard drive at all. I can't save files from already running programs. Trying to run any programs from the termial gives something like bash: input/output error.
The system works fine upon reboot, and I don't seem to be getting any data corruption. Smartctl -a output looks fine. Filesystem is XFS. This crash has happened twice in the last week, and otherwise only one other time maybe 2-3 weeks ago.
The problem drive is a Samsung PM961, 1TB nvme drive and the system is a Lenovo P51 laptop. There is another ssd in the laptop, part of which is partitioned for win10 and part of which is available empty space.
How can I get more info about what is happenning? I've tried running journalctl -b -1 after rebooting, but there are no errors recorded there. Should I be looking anywhere else for logs? Are there any programs I could leave running in the background and then check when I notice a crash?
Last edited by lllars (2017-08-29 22:59:43)
Offline
I would suggest checking the desk using S.M.A.R.T.
You could also use `journalctl -f` to follow the journal or `dmesg -w` to follow dmesg.
Offline
Thanks. I've now got 'journalctl -f' and 'dmesg -w' running in the background. I don't think I can run smart tests because it is an nvme drive.
I suspect it is an issue related to APST. There is a known issue with the PM951 (note that I've got a 961, not a 951). It's just hard to troubleshoot with it happenning so infrequently. See here for known issue: https://github.com/damige/linux-nvme/issues/2
Offline
You can run SMART tests on an nvme drive, they should support it as well.
Offline
I don't appear to be able to run SMART tests. Maybe I'm missing something?
$ sudo smartctl -t short /dev/nvme0
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.12.8-2-ARCH] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
NVMe device successfully opened
Use 'smartctl -a' (or '-x') to print SMART (and more) information
$ sudo smartctl -a /dev/nvme0
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.12.8-2-ARCH] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: SAMSUNG MZVLW1T0HMLH-00000
Serial Number: S2U3NX0J103036
Firmware Version: CXY7301Q
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 1,024,209,543,168 [1.02 TB]
Unallocated NVM Capacity: 0
Controller ID: 2
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1,024,209,543,168 [1.02 TB]
Namespace 1 Utilization: 434,903,560,192 [434 GB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Wed Aug 30 10:02:18 2017 EDT
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL *Other*
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Warning Comp. Temp. Threshold: 68 Celsius
Critical Comp. Temp. Threshold: 71 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 7.60W - - 0 0 0 0 0 0
1 + 6.00W - - 1 1 1 1 0 0
2 + 5.10W - - 2 2 2 2 0 0
3 - 0.0400W - - 3 3 3 3 210 1500
4 - 0.0050W - - 4 4 4 4 2200 6000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning: 0x00
Temperature: 30 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 222,354 [113 GB]
Data Units Written: 1,338,117 [685 GB]
Host Read Commands: 7,515,904
Host Write Commands: 11,179,893
Controller Busy Time: 30
Power Cycles: 483
Power On Hours: 50
Unsafe Shutdowns: 22
Media and Data Integrity Errors: 0
Error Information Log Entries: 172
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 30 Celsius
Temperature Sensor 2: 34 Celsius
Error Information (NVMe Log 0x01, max 64 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 172 0 0x0008 0x4004 - 0 0 -
1 171 0 0x0008 0x4004 - 0 0 -
2 170 0 0x0008 0x4004 - 0 0 -
3 169 0 0x0008 0x4004 0x02c 0 0 -
4 168 0 0x0022 0x4004 0x02c 0 0 -
5 167 0 0x0021 0x4004 0x02c 0 0 -
6 166 0 0x0008 0x4004 - 0 0 -
7 165 0 0x0008 0x4004 - 0 0 -
8 164 0 0x0008 0x4004 - 0 0 -
9 163 0 0x0008 0x4004 - 0 0 -
10 162 0 0x0008 0x4004 - 0 0 -
11 161 0 0x0008 0x4004 - 0 0 -
12 160 0 0x0008 0x4004 - 0 0 -
13 159 0 0x0008 0x4004 - 0 0 -
14 158 0 0x0008 0x4004 - 0 0 -
15 157 0 0x0008 0x4004 - 0 0 -
... (48 entries not shown)
sudo smartctl -c /dev/nvme0
[sudo] password for lars:
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.12.8-2-ARCH] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL *Other*
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Warning Comp. Temp. Threshold: 68 Celsius
Critical Comp. Temp. Threshold: 71 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 7.60W - - 0 0 0 0 0 0
1 + 6.00W - - 1 1 1 1 0 0
2 + 5.10W - - 2 2 2 2 0 0
3 - 0.0400W - - 3 3 3 3 210 1500
4 - 0.0050W - - 4 4 4 4 2200 6000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
Offline