[RESOLVED] HD issue on shutdown with kernel 4.8

jmx · 2016-11-10 09:52:45

Hi,

I have Arch installed on a second HD in my elderly MacBookPro (late 2010 13" model). The HD is installed in a HD caddy in place of the by-now-broken optical drive. I never had issues with this setup until the upgrade to kernel 4.8.

Since that kernel update, the HD seems to be powered off too early in the shutdown/reboot process which is clearly audible as a 'clonk', and the following lines appear in the shutdown log at that point in time:

ata2: exception Emask 0x10 SAct 0x0 SErr 0x1990000 action 0xe frozenbd204ed5c
ata2: irq_stat 0x00400000, PHY RDY changed
ata2: SError: { PHYRdyChg 10B8B Dispar LinkSeq TrStaTrns }

A little more information about my setup:

$ dmesg | grep ata2 | head
[    2.508533] ata2: SATA max UDMA/133 abar m8192@0xd3484000 port 0xd3484180 irq 26
[    2.824504] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[    2.825383] ata2.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES) filtered out
[    2.825543] ata2.00: ATA-8: Hitachi HTS545032B9SA02, PB3AC60W, max UDMA/133
[    2.825545] ata2.00: 625142448 sectors, multi 0: LBA48 NCQ (depth 31/32)
[    2.826577] ata2.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES) filtered out
[    2.826745] ata2.00: configured for UDMA/133
$ readlink /dev/disk/by-id/*HTS545032B9SA02* | head -n1
../../sdb
$ lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sdb      8:16   0 298,1G  0 disk
├─sdb2   8:18   0 268,1G  0 part /home
└─sdb1   8:17   0    30G  0 part /
sda      8:0    0 931,5G  0 disk
├─sda2   8:2    0 930,7G  0 part
├─sda3   8:3    0 619,9M  0 part
└─sda1   8:1    0   200M  0 part /boot

So ata2 is /dev/sdb which houses my Linux root and home partitions. (/dev/sda is the MacOS HD).

The machine shuts down or reboots a few seconds later, but I don't think that this is intended, given that the issue does not appear when running linux-lts.
To sum up, it seems that /dev/sdb is powered off too early on kernel 4.8 (while it is still spinning and probably still writing data, e.g. likely loss of data or worse).

Any hints what could be the problem here?

Last edited by jmx (2016-12-06 21:26:05)

R00KIE · 2016-11-10 13:45:28

You should provide the full output of dmesg so people can see the whole picture, providing only snippets lacks all the context of what happens before and after and also other messages that might help find the problem.

jmx · 2016-11-10 15:07:54

I created two logfiles using

$ journalctl -b -1 -nall > kernel4x-log.txt

each done after a short session running linux-lts (4.4.30) and linux (4.8.6):

https://dl.dropboxusercontent.com/u/367 … 44-log.txt
https://dl.dropboxusercontent.com/u/367 … 48-log.txt

I tried to follow the "Diagnosing Shutdown Problems" advice on https://freedesktop.org/wiki/Software/s … /#index2h1 , but onfortunately, the debug script did not create the file, neither on 4.4 nor on 4.8 (both files above were created while running with the options system.log_level=debug system.log_target=kmsg log_bug_len=1M enforcing=0, as suggested there).

I hope that this information helps.

Best,
jmx.

jmx · 2016-11-11 12:04:00

Just a quick note: The issue is still there in Kernel 4.8.7.

R00KIE · 2016-11-11 22:35:30

That's a lot of output, but from what I can see it is systemd that is complaining not the kernel itself. Can to you try to boot only to your login manager or even just multi-user.target and then rebooting or shutting down and check if you still have problems?

Have you done any changes to increase power savings or are you running any power management daemons?

jmx · 2016-11-12 11:27:38

I tried both, but neither booting only into gdm nor just into multi-user.target made the issue go away.

I could not remember installing let alone tweak any power management daemon, so I grepped my journalctl manually. The only daemon with power in its name is upower. It is only running when booting into gdm or gnome. But since the issue is still there even if I'm not starting gdm, upower can't be the culprit, or am I wrong?

R00KIE · 2016-11-12 14:33:37

I'd say that upower shouldn't be the culprit. I'd be more inclined to say that it _might_ be something from systemd related to disk/partition detection/management but I have no idea of what it could be and I don't recall having seen anyone complain of a similar problem before.

The only sensible advice I can give you is to try using the Hitachi disk in another machine that is not showing that problem and see what happens, or using another disk in the place of the Hitachi disk, maybe that can provide some clues as to what is wrong.

You could also try asking in the systemd mailing list and see what they say about this problem.

jmx · 2016-11-12 14:40:44

It was my first thought as well that it might have something to do with how systemd sequences the shutdown. But what stumbled me was that it happens with linux 4.8 but not with linux 4.4 (LTS). I'm not that familiar with systemd yet, but I can't really imagine how just swapping the kernel could make systemd change the order of the shutdown events?!

Since I don't have another machine at hand where I could just put the HD in for a test spin, I think I will try asking the systemd guys about this issue.

Thank you for your help so far, R00KIE!

R00KIE · 2016-11-12 16:00:24

If changing only the kernel makes the problem go away it points to a kernel problem. You could still ask on the systemd mailing list as the problem could be due to some interaction between systemd and newer kernels (different code paths?) but I'd say you should also bisect the kernel and try to find out which commit makes that problem show up.

seth · 2016-11-13 09:10:35

Adding two wild guesses:
pass "nodisconnect" to the kernel parameters and set /sys/class/scsi_host/host*/link_power_management_policy to "max_performance"

It seems the disk is powered down when removing all mounts and thus systemd fails on late accesses to the device.

jmx · 2016-11-21 09:42:50

Unfortunately, the "nodisconnect" parameter had no effect, and link_power_management_policy already was set to max_performance for all scsi_hosts in my system that had that setting.

jmx · 2016-11-26 12:14:15

Just for reference: I posted the issue both on

systemd-devel: https://lists.freedesktop.org/archives/ … 37882.html

and on

linux-scsi: https://marc.info/?l=linux-scsi&m=148000826102985&w=2

Lennart Poettering was sure that his had to be a kernel issue, so I went for linux-scsi, but received no reply so far.

mich41 · 2016-11-26 13:30:21

Remove quiet from kernel command line and add loglevel=7. Maybe something more will appear. Do you know what is systemd doing immediately before and after the disk loses power?

edit: Forgot to add: if you can't save debug logs to HDD, try saving them to some pendrive.

Last edited by mich41 (2016-11-26 13:38:57)

jmx · 2016-11-30 14:32:18

I finally got around to get mich41's suggestions to work and save the dmesg output (loglevel=7 as kernel parameter) on a USB pen drive, using a script in /usr/lib/systemd/system-shutdown. Here are the new log files:

https://dl.dropboxusercontent.com/u/367 … wn-log.txt
https://dl.dropboxusercontent.com/u/367 … wn-log.txt

But when going through the log for kernel 4.8, I'm now running into log rate limiting issue during shutdown, exactly where it gets interesting:

[  243.611976] systemd-shutdow: 738 output lines suppressed due to ratelimiting
...
[  245.293951] systemd-journal: 61 output lines suppressed due to ratelimiting

In the log file for 4.4, I had this issue, too, at first, but after I'd explicitely set

RateLimitIntervalSec=0
RateLimitBurst=0

in /etc/systemd/journald.conf, dmesg printed everything. But on kernel 4.8, there are still rate limits at work. How can I turn them off?

jmx · 2016-12-01 09:14:10

Ok, since systemd-journald refuses to honor the rate limit configuration I set in /etc/systemd/journald.conf (I will open another thread for this issue soon), I'm now back at transcribing screen photographs of the messages posted during shutdown... hello 2016 :-/

At least, there are some more messages due to loglevel=7 now. It all happens during the phase when systemd is busy unmounting all the file systems, so here we go:

[  258.915836] shutdown[1]: Unmounting file systems.
[  258.916251] shutdown[1]: Remounting '/' read-only with options 'mode=755'.
[  258.916429] shutdown[1]: Remounting '/oldroot/sys/kernel/config' read-only with options ''.
[  258.917713] shutdown[1]: Unmounting /oldroot/sys/kernel/config.
[  258.923546] shutdown[1]: Remounting '/oldroot/dev/hugepages' read-only with options ''.
[  258.923775] shutdown[1]: Unmounting /oldroot/dev/hugepages.
[  258.940092] shutdown[1]: Remounting '/oldroot/sys/kernel/debug' read-only with options ''.
[  259.337170] EXT4-fs (sdb1): re-mounted. Opts: data=ordered
[  259.457632] EXT4-fs (sdb1): re-mounted. Opts: data=ordered
[  259.524764] kvm: exiting hardware virtualization  <*** 'CLANK!' audible HERE ***>
[  260.249501] ata2: exception Emask 0x10 SAct 0x0 SErr 0x1990000 action 0xe frozen
[  260.249695] ata2: irq_stat 0x00400000, PHY RDY changed
[  260.249952] ata2: SError: { PHYRdyChg 10B8B Dispar LinkSeq TrStaTrns }
[  260.251441] ata2: hard resetting link
[  260.963398] ata2: SATA link down (SStatus 0 SControl 300)
[  261.068403] sd 1:0:0:0: [sdb] Synchronizing SCSI cache
[  266.080099] ata2: hard resetting link
[  266.394133] ata2: SATA link down (SStatus 0 SControl 300)
[  266.394369] ata2: limiting SATA link speed to 1.5Gbps
<*** A few seconds later, there are 2-3 lines more appearing, but I never managed to capture them on picture, as the machine reboots right afterwards. ***>

As the 'CLANK!' comes shortly after "kvm: exiting hardware virtualization", can this be the part in the kernel that's causing my issue? Any ideas?

Last edited by jmx (2016-12-01 09:16:11)

mich41 · 2016-12-01 20:09:31

This kvm message is printed only when the kernel itself is shutting down. Since telling the kernel to shut down should be the last thing systemd does, this problem seems to be indeed kernel's fault and not triggered by systemd or any other userspace daemon.

If you want to be 100% sure, run poweroff -f (which simply requests kernel shutdown without telling systemd or anybody else anything) and see if it produces the same effect. You need to be on TTY, not X, to see the kernel error messages and you may lose unsaved files.

This SATA link error suggests that the disk isn't merely programmed to shut down but completely loses power. I think the problem is caused by some power management thing (ACPI? UEFI?) and not the SCSI subsystem, so you may want to ask on other mailing lists.

I don't know if you are risking data loss. It seems common sense to flush disk caches when remounting readonly, so it's possible that from the moment systemd remounts / all data are safely on disk. But to be really sure, you can power down the disk normally in this shutdown script by adding the following:

echo 1 > /sys/block/sdb/device/delete

Not a pretty solution, but gets the job done

As for rate limiting, this is happening in the kernel. To disable:

echo 0 > /proc/sys/kernel/printk_ratelimit

jmx · 2016-12-06 14:34:47

Thanks mich41 for your suggestions and hints.

By chance I came across a bug report being mentioned on linux-scsi yesterday that sounded familiar:

https://bugzilla.kernel.org/show_bug.cgi?id=187061

I did a test build of kernel 4.8.12 this morning, with the patch from the bug report applied, and it resolved my issue.

I already posted a note to the bug report and to linux-scsi that I'd like to see a backport to kernel 4.8. Let's hope that they'll take notice!

mich41 · 2016-12-06 18:02:14

Probably will be ignored because it has nothing to do with storage and not all lists have people with time and will to correct random stranger's mistakes. This really should go to people who work on this ACPI stuff, so linux-acpi or this bug report.

It also is not clear if this patch will be reverted in stable releases at all because it looks like it was fixing some annoying BIOS popups for somebody else.

jmx · 2016-12-06 21:23:15

Well, it was reverted in 4.9-rc7. So I expect it to be in stable 4.9. I was just hoping that it might find its way into 4.8 as well, as it might still take some time until 4.9 is shipped for Arch.

Thanks for your advice that linux-acpi might be the better place instead of linux-scsi. But since I posted a note to the bug report already, I think I won't bother them anymore. Either they do fix it or I'll wait until 4.9 hits stable.

Arch Linux

#1 2016-11-10 09:52:45

[RESOLVED] HD issue on shutdown with kernel 4.8

#2 2016-11-10 13:45:28

Re: [RESOLVED] HD issue on shutdown with kernel 4.8

#3 2016-11-10 15:07:54

Re: [RESOLVED] HD issue on shutdown with kernel 4.8

#4 2016-11-11 12:04:00

Re: [RESOLVED] HD issue on shutdown with kernel 4.8

#5 2016-11-11 22:35:30

Re: [RESOLVED] HD issue on shutdown with kernel 4.8

#6 2016-11-12 11:27:38

Re: [RESOLVED] HD issue on shutdown with kernel 4.8

#7 2016-11-12 14:33:37

Re: [RESOLVED] HD issue on shutdown with kernel 4.8

#8 2016-11-12 14:40:44

Re: [RESOLVED] HD issue on shutdown with kernel 4.8

#9 2016-11-12 16:00:24

Re: [RESOLVED] HD issue on shutdown with kernel 4.8

#10 2016-11-13 09:10:35

Re: [RESOLVED] HD issue on shutdown with kernel 4.8

#11 2016-11-21 09:42:50

Re: [RESOLVED] HD issue on shutdown with kernel 4.8

#12 2016-11-26 12:14:15

Re: [RESOLVED] HD issue on shutdown with kernel 4.8

#13 2016-11-26 13:30:21

Re: [RESOLVED] HD issue on shutdown with kernel 4.8

#14 2016-11-30 14:32:18

Re: [RESOLVED] HD issue on shutdown with kernel 4.8

#15 2016-12-01 09:14:10

Re: [RESOLVED] HD issue on shutdown with kernel 4.8

#16 2016-12-01 20:09:31

Re: [RESOLVED] HD issue on shutdown with kernel 4.8

#17 2016-12-06 14:34:47

Re: [RESOLVED] HD issue on shutdown with kernel 4.8

#18 2016-12-06 18:02:14

Re: [RESOLVED] HD issue on shutdown with kernel 4.8

#19 2016-12-06 21:23:15

Re: [RESOLVED] HD issue on shutdown with kernel 4.8

Board footer