System loosing Samsung NVMe SSD and crashing

Toun · 2017-12-14 16:40:41

Hello,

My system keeps crashing very badly because the root drive sometimes gets lost. The drive in question is a Samsung 960 Pro 512GB NVMe SSD.

Given that the root itself is lost, I couldn't get a proper logfile. But thanks to a "while true; do sync; echo $(date); sleep 1; done" trick, I was able to monitor the disk and as soon as it crashed, I switched to tty2 to catch this (sorry for not having a log file):

http://images.franceserv.fr/images/2017 … e9acdd.jpg
And followed by a flood of kernel and systemd messages of this kind until I hard reboot:

EXT4-fs error (device nvme0n1p2): __ext4_get_inode_loc:4559: inode #28312661: block 113246309: comm pacman: unable to read itable block
systemd-journald[241]: Failed to write entry (9 items, 261 bytes), ignoring: Read-only file system

This happened several times yesterday, and even bricked my system while pacman was updating the kernel .. Today I've done a clean install but the problem keeps happening. Installing stuff with pacman seems to be triggering it almost every time, but it can also happen without me doing anything special.

I've found that pre 4.11 kernels could do that with NVMe disks not supporting APST (https://bugs.launchpad.net/ubuntu/+sour … ug/1678184). As recommended on the bug report, I've added "nvme_core.default_ps_max_latency_us=0" to my kernel arguments list to disable APST completely. Although it seems to have successfuly disabled APST, that hasn't fixed the problem:

$ sudo nvme get-feature -f 0x0c -H /dev/nvme0n1
get-feature:0xc (Autonomous Power State Transition), Current value:00000000
	Autonomous Power State Transition Enable (APSTE): Disabled

I've also updated the SSD firmware to the latest version to no avail.

He are more details on the drive:

$ sudo nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev  
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     S3EWNCAHC07421T      Samsung SSD 960 PRO 512GB                1          26.51  GB / 512.11  GB    512   B +  0 B   2B6QCXP7

This was not happening until yesterday (december 13 2017), but I didn't pay attention to what was recently updated so I cannot make any correlation with a kernel update or such.

Do you think this could be a hardware fault (SMART stats are looking good to me), or a recent bug introduction?

Thanks

Last edited by jasonwryan (2017-12-14 20:03:43)

progandy · 2017-12-14 16:52:03

This was not happening until yesterday (december 13 2017), but I didn't pay attention to what was recently updated so I cannot make any correlation with a kernel update or such.

Do you know the date of your last successful update? You can then use ALA to downgrade to that date. The downgrade process should also show you the changed packages.
https://wiki.archlinux.org/index.php/Ar … cific_date

Edit: You probably did not save your pacman.log from the old installation, or you could use that to see the changes.

Last edited by progandy (2017-12-14 16:53:54)

Toun · 2017-12-14 17:03:20

Indeed the system was unbootable and my data drive was intact so I didn't bother saving any files.

Edit: That said I ran "pacman -Syu" almost every day so if I could find a history of kernel updates on the arch repository?

Last edited by Toun (2017-12-14 17:06:09)

progandy · 2017-12-14 17:15:08

Edit: That said I ran "pacman -Syu" almost every day so if I could find a history of kernel updates on the arch repository?

Don't forget that a kernel update takes effect only after a reboot.

You can see the kernel history here:
https://git.archlinux.org/svntogit/pack … ages/linux
https://archive.archlinux.org/packages/l/linux/

If you suspect a kernel issue, then you could try linux-lts, but only with APST disabled since it is v4.9

Toun · 2017-12-14 17:29:42

Thanks. Last kernel update was on sunday, and I did reboot since then so I believe the kernel's not at fault here. I'll try LTS anyway.

If LTS ends up working, is there anyway to try non-LTS but older releases to find which one may be the culprit? I think upgrading partially the system is not particularly a good idea ..

progandy · 2017-12-14 17:33:56

Toun wrote:

If LTS ends up working, is there anyway to try non-LTS but older releases to find which one may be the culprit? I think upgrading partially the system is not particularly a good idea ..

The kernel is an exception, you can use older packages to test them, especially versions between the current and the lts version. You may loose some external kernel modules like nvidia or virtualbox, though. I already gave you the link to the older packages.

Last edited by progandy (2017-12-14 17:34:32)

Lazzu · 2017-12-14 19:43:49

I have the 1TB EVO model of the same kind.

$ sudo nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev  
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     S3ETNX0J208502P      Samsung SSD 960 EVO 1TB                  1         409,45  GB /   1,00  TB    512   B +  0 B   2B7QCXE7

The only problem I've had is that the filesystem disappeared after reboot, which was fixed by this trick.

I haven't done my daily package update routine yet. I don't think I did it yesterday either. I might be keeping a close eye on this thread and try to remember not to upgrade yet.

Last edited by Lazzu (2017-12-14 19:44:47)

Toun · 2017-12-14 19:57:43

After further testing, it looks like this happens when disk I/O gets a bit intense. I could install some very small packages without crashing (hdparm), and trying to benchmark the disk with "hdparm -Tt /dev/nvme0n1" did crash the disk. Also I was totally unable to install another kernel; pacman downloading the ~80MB package was enough to crash.

Given that I was able to install the system from scratch in a live session without issue, I'm pretty confident this isn't a hardware fault. If the live ISO image uses kernel 4.14.3-1 (latest kernel released in november), I'll try to somehow install that from a live session.

Lazzu · 2017-12-14 20:19:25

Toun wrote:

After further testing, it looks like this happens when disk I/O gets a bit intense. I could install some very small packages without crashing (hdparm), and trying to benchmark the disk with "hdparm -Tt /dev/nvme0n1" did crash the disk. Also I was totally unable to install another kernel; pacman downloading the ~80MB package was enough to crash.
Given that I was able to install the system from scratch in a live session without issue, I'm pretty confident this isn't a hardware fault. If the live ISO image uses kernel 4.14.3-1 (latest kernel released in november), I'll try to somehow install that from a live session.

Does it get hot? Nvme drives are known for getting overheated.

Toun · 2017-12-14 21:19:52

All right I downgraded to linux-4.14.3-1 and so far no problem whatsoever. I just installed several big packages while running "hdparm -Tt /dev/nvme0n1" in loop and it hold up. I'll still wait to work on my desktop tomorrow to make sure but it looks like that's a kernel issue.

Lazzu wrote:

Does it get hot? Nvme drives are known for getting overheated.

I don't think so, it's cooled with an EKWB heatskink. But yeah, not sure if I should dig this and submit a bug or just wait for the next release.

Lazzu · 2017-12-15 16:30:40

Toun wrote:

All right I downgraded to linux-4.14.3-1 and so far no problem whatsoever. I just installed several big packages while running "hdparm -Tt /dev/nvme0n1" in loop and it hold up. I'll still wait to work on my desktop tomorrow to make sure but it looks like that's a kernel issue.

I just upgraded to 4.14.5-1 from 4.14.4-1, and I haven't encountered any problems yet with my 960 EVO.

Toun wrote:

Lazzu wrote:
Does it get hot? Nvme drives are known for getting overheated.
I don't think so, it's cooled with an EKWB heatskink. But yeah, not sure if I should dig this and submit a bug or just wait for the next release.

At work we have a build server that spits out Linux server, Android client and iOS client builds from each push to the git repository main branch. At first it worked fine, but after some time at the end of the day the server started crashing. We figured out that it was the nvme drive overheating even though it was being cooled. The problem got worse over time and the server host guys did some maintenance that was supposed to improve the temperature of the drive, but it still keeps crashing from a single push to the git repo, and sometimes even while idling. In contrast I've had no problems with my nvme drive at home, which only has the default and imo fairly bad cooler that came with the asus mobo. But that's seen only a couple of months of usage for now. But it's still a very possible issue.

Toun · 2017-12-15 20:40:08

I might actually check that the temps are low enough. On linux-4.14.3, the NVidia driver didn't work and so my 3 graphics cards were totally idle, which could explain that the SSD wasn't as warm enough to crash. But if that's the case, shouldn't the SSD just throttle down?

I'll reinstall the latest kernel and monitor temps to try and make a correlation.

progandy · 2017-12-15 20:48:58

3 graphics cards? Is your power supply powerful enough, maybe the SSD doesn't get enough power and shuts down? Or there may be some conflict on a data bus.

You could try to remove a few graphics cards or remove the driver and check if that helps.

Last edited by progandy (2017-12-15 20:52:59)

Toun · 2017-12-15 21:06:58

My PSU is rated 1200W and I made sure to spread devices over all the 12V rails. I measured a maximum of 700W at the wall so I doubt this is the root of the problem.

Currently watching the temps, and actually with the disk at idle and the graphics cards mining, it's already at 52 Celcius. I will now try to stress the disk and see at what kind of temps its crashes.

EDIT: Ok so the temps are not a problem, they maxed out at 60 Celcius during my tests. But I can at least say that only write operations crash the disk. I could "hdparm -t /dev/nvme0n1" all I want, but the first "dd if=/dev/zero of=/test bs=1000000000 count=1" made it crash at 50 Celcius.

Is there any way to use the nvidia driver on an older kernel? I'm guessing installing a package from https://archive.archlinux.org/packages/n/nvidia/ ? That way I could confirm the kernel is the problem.

EDIT2: I just managed to crash linux-4.14.3-1 too ... guess the latest kernel isn't the problem either. Now I'm out of ideas. This is so inconsistent and hard to troubleshoot, I'll try some benchmarking software on Windows when I'm feeling like reinstalling my system again.

Thanks anyway for all your suggestions.

Last edited by Toun (2017-12-15 22:19:07)

Da_Coynul · 2018-01-03 01:57:51

@Toun -

Did you ever get this sorted out? After reading this thread, I am wondering if you have a hardware problem. Have you thought about RMA'ing the SSD?

Arch Linux

#1 2017-12-14 16:40:41

System loosing Samsung NVMe SSD and crashing

#2 2017-12-14 16:52:03

Re: System loosing Samsung NVMe SSD and crashing

#3 2017-12-14 17:03:20

Re: System loosing Samsung NVMe SSD and crashing

#4 2017-12-14 17:15:08

Re: System loosing Samsung NVMe SSD and crashing

#5 2017-12-14 17:29:42

Re: System loosing Samsung NVMe SSD and crashing

#6 2017-12-14 17:33:56

Re: System loosing Samsung NVMe SSD and crashing

#7 2017-12-14 19:43:49

Re: System loosing Samsung NVMe SSD and crashing

#8 2017-12-14 19:57:43

Re: System loosing Samsung NVMe SSD and crashing

#9 2017-12-14 20:19:25

Re: System loosing Samsung NVMe SSD and crashing

#10 2017-12-14 21:19:52

Re: System loosing Samsung NVMe SSD and crashing

#11 2017-12-15 16:30:40

Re: System loosing Samsung NVMe SSD and crashing

#12 2017-12-15 20:40:08

Re: System loosing Samsung NVMe SSD and crashing

#13 2017-12-15 20:48:58

Re: System loosing Samsung NVMe SSD and crashing

#14 2017-12-15 21:06:58

Re: System loosing Samsung NVMe SSD and crashing

#15 2018-01-03 01:57:51

Re: System loosing Samsung NVMe SSD and crashing

Board footer