You are not logged in.

#1 2018-07-04 10:58:03

citrus
Member
Registered: 2018-07-04
Posts: 8

[SOLVED] NVMe Controller Reset in 4.17

Hello!

I have recently updated to 4.17.3-1 and started experiencing fatal file system conditions related to my SSD.

Here's the setup of my laptop:
- Lenovo ThinkPad T570 (8GB RAM, Intel Core i7 7th gen.), BIOS v1.34 (latest)
- single 500GB SSD drive (LENSE20512GMSP34MEAT2TA) connected via NVMe
- EXT4 file system encrypted with LUKS
- checked with SMART today - no health issues found

My problems usually start once I'm logged into my X seat, usually after a few minutes of usage. For some reason, the NVMe controller goes "down" and the kernel attempts to reset it. Consequently, the entire file system is re-mounted as read-only and I get a ton of I/O errors. After that, I can't even shut the computer down properly because both "logind" and "init" are unaccessible from shell, so I have to perform a hard power-off.

I have investigated this for a day and found that the condition might be related to hardware quirks in APST (power saving). To verify, I have tried running my kernel with the option `nvme_core.default_ps_max_latency_us=0` which should disable APST altogether. Using `nvme-cli` I have confirmed that previously APST was enabled, and with the option it is now disabled. However, the issue still persists.

Since I'm rebooting now quite frequently, I have noticed another, possibly a secondary symptom related to this problem, which appears in a fraction of my boots. Sometimes, BIOS has troubles finding GRUB. Other times, LUKS has troubles "decrypting the keyslot". If the problem really is somehow related to the NVMe controller, I can only speculate that if a reset occurs prior to (or during) boot sequence, the software acts as if there was no hard drive present.

I'd like to note that my laptop is only 6 months old, and until yesterday, my SSD was operating well without problems. Unfortunately, NVMe and APST support is not documented too well, so I'm starting to run out of ideas. I would surely welcome a second opinion.

Find the "screenshot" of my dmesg here: https://imgur.com/a/fWyfihk


Thank you in advance!
Petr

Last edited by citrus (2018-10-22 19:10:09)

Offline

#2 2018-07-04 11:06:37

V1del
Forum Moderator
Registered: 2012-10-16
Posts: 11,861

Re: [SOLVED] NVMe Controller Reset in 4.17

I currently can't look at your screenshot , however please just post text as text

Offline

#3 2018-07-04 11:47:09

citrus
Member
Registered: 2018-07-04
Posts: 8

Re: [SOLVED] NVMe Controller Reset in 4.17

I would always prefer to post text as text. However, in this case, the issue disrupts my network manager, and since the file system is read-only, saving the output is not possible. For these reasons, I had to resort to physically shooting my screen with a phone camera.

For your convenience, I took the liberty of transcribing the interesting part: https://pastebin.com/NMKSHYKv

Last edited by citrus (2018-07-04 11:47:48)

Offline

#4 2018-07-05 15:00:00

loqs
Member
Registered: 2014-03-06
Posts: 12,619

Re: [SOLVED] NVMe Controller Reset in 4.17

Can you try linux-lts to or an older version of the linux package that did not experience the issue in the past to rule out a hardware issue?
You could also try 4.18-rc3 or bisecting between 4.16 and 4.17 to find the causal commit.

Offline

#5 2018-07-05 15:41:50

citrus
Member
Registered: 2018-07-04
Posts: 8

Re: [SOLVED] NVMe Controller Reset in 4.17

I will try LTS. However, I think it's not wise to install anything on the SSD at this point. The I/O issue may occur at any time and disrupt whatever installation is ongoing.
Instead, I'll try several bootable distros and see whether the issue persists. If that is the case, we will have a pretty solid evidence in favor of a hardware problem.

Offline

#6 2018-07-06 07:48:42

citrus
Member
Registered: 2018-07-04
Posts: 8

Re: [SOLVED] NVMe Controller Reset in 4.17

Not yet done with my tests, but found some nice reading on this. Might be useful for people experiencing the same problem.
Apparently, the same problem was encountered on Fedora 26 and Ubuntu Zesty (4.10).

https://bugzilla.redhat.com/show_bug.cgi?id=1487421
https://bugs.launchpad.net/ubuntu/+sour … ug/1678184

Offline

#7 2018-09-27 18:35:13

crazyboycjr
Member
Registered: 2016-02-05
Posts: 1

Re: [SOLVED] NVMe Controller Reset in 4.17

citrus wrote:

Not yet done with my tests, but found some nice reading on this. Might be useful for people experiencing the same problem.
Apparently, the same problem was encountered on Fedora 26 and Ubuntu Zesty (4.10).

https://bugzilla.redhat.com/show_bug.cgi?id=1487421
https://bugs.launchpad.net/ubuntu/+sour … ug/1678184


I have the exactly same symptom as you. Having tried nvme_core.default_ps_max_latency_us=0 or 2000 or any other values by reading the two links you posted, a few days passed but the problem still exists. However the  kernel parameter seems to extend the living time more or less (maybe several hours but sometimes still crash in 1 hour after rebooting).

# nvme get-feature -f 0x0c -H /dev/nvme0
get-feature:0xc (Autonomous Power State Transition), Current value:00000000
	Autonomous Power State Transition Enable (APSTE): Disabled
...

At first I suspect there was hardware issues of my SSD(my laptop is Dell XPS 15 9550 with a Samsung 950 Pro NVMe). However it seems windows runs well on it.

I am on 4.18.9 kernel, and seems currently the issue still exists in this kernel. So neither new kernel nor disable APST works for me. There is a section refers to some related information.
https://wiki.archlinux.org/index.php/De … 0)#Storage
However, in my case, the NVMe controller fails no matter on battery power or plugged-in.

Next I will try LTS kernel and see if this issue will disappear.

Thanks for your post and information!

Jingrong

Last edited by crazyboycjr (2018-09-27 18:50:11)

Offline

#8 2018-10-22 19:09:44

citrus
Member
Registered: 2018-07-04
Posts: 8

Re: [SOLVED] NVMe Controller Reset in 4.17

Wouldn't want this to become one of those abandoned threads. So, here's an update.

The problem is now SOLVED! With some of you pointing out that the issue is likely caused by hardware rather than software, and my exhausting all other conceivable options, I have decided to invoke my warranty and have both my SSD and motherboard replaced. After the procedure, none of the symptoms have reappeared. I have been using the laptop for 2 weeks now without any difficulties.

Here are some observations of mine which might be relevant to people curious about the problem:

1) Since the SSD was always available for a period of time before the symptoms showed up, I have managed to perform a complete backup of my LUKS-encrypted system using USB-booted linux and dd. After installing the new SSD, I have dd'ed everything onto it, verified that the system mounts correctly, chroot'ed and reinstalled GRUB. From that point onward, I had full access to the capabilities of the previous system on the new hardware.

2) The warranty repair was done in 2 on-site sittings dedicated to the motherboard and the SSD, respectively. When I booted up my laptop between the sittings (that is with old SSD but new motherboard), BIOS could see the SSD in device list but acted as if there was no system installed on it. Could that be because the bootloader is stored somewhere on the motherboard or because of the issue with the SSD?

Thanks for your inputs and your suggestions. You have steered me in the right direction!

Cheers,
Petr

Offline

Board footer

Powered by FluxBB