Disks keep dropping out of MD raid array

scott_fakename · 2023-10-27 21:34:53

I have a 4 disk raid 6 mdraid array set up that i've been running for a couple years (2-3) in my NAS device.

In the last few months, I've had repeated instances where when I boot the NAS or bring it up from sleep (i suspend it when not in use), one of the disks will fail to join the array. SMART self tests say the disks are still healthy, and if I re-add it with:

sudo mdadm --manage /dev/md127 --re-add /dev/sdb1

... then it recovers fine. But today, it happened twice with the same disk: I booted it up, noticed a disk was missing, put it back in, rebooted after a kernel update, and the same disk was missing.

Any idea what I need to check? Is this a sign the disks are dying? Or is it just a weird new mdraid bug? Can I adjust the timeout for how long it waits for disks to spin up before calling them failed?

Here's my dmesg from the current boot: http://0x0.st/Hyce.txt

seth · 2023-10-28 07:33:37

SMART self tests say the disks are still healthy

What exactly does it say?
Ideally also please post your complete system journal for the boot:

sudo journalctl -b | curl -F 'file=@-' 0x0.st

sdd shows up 500ms after a,b&c but is not the troublesome device and there're no further errors or even oddities (about sdb) in the dmesg.
There's no indication that the drive somehow failed to spin up and the raid is built 35s into the boot.

One more thing you could to is try to swap it w/ a "good" device (assuming it's always sdb) to see whether the problem moves (to eg. sdc) or the new disk on the same bus also exhibits the problem.

scott_fakename · 2023-10-28 16:49:43

what exactly does it say?

I'm just looking at http://0x0.st/Hya6.txt

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

The first time I noticed it it was /dev/sdd, but this time it was /dev/sdb twice in a row.

seth · 2023-10-28 19:14:49

The smart data looks uncritical - the drive order isn't necessarily deterministic, you'll have to check the disk UUID to see whether it's always the same drive.

Since the raid is create late into the boot, what happens in the userspace in the meantime and is there anything interesting in the journal?

topcat01 · 2023-10-28 21:27:25

It might also be a good idea to run a `smartctl -t long` on the drive. I have multiple Reds and it will take a while. Do not suspend/reboot the system during testing.

scott_fakename · 2023-10-28 23:06:34

Hmm. I did notice a few cases where the journalctl logs showed:

Sep 23 12:33:50 nas-bot.uraxa.lan kernel: md: kicking non-fresh sdd1 from array!
Sep 25 19:07:41 nas-bot.uraxa.lan kernel: md: kicking non-fresh sdd1 from array!
Sep 26 19:33:58 nas-bot.uraxa.lan kernel: md: kicking non-fresh sdd1 from array!
Oct 03 18:22:25 nas-bot.uraxa.lan kernel: md: kicking non-fresh sdd1 from array!
Oct 04 18:28:23 nas-bot.uraxa.lan kernel: md: kicking non-fresh sdd1 from array!
Oct 06 18:38:03 nas-bot.uraxa.lan kernel: md: kicking non-fresh sdd1 from array!
Oct 07 15:30:38 nas-bot.uraxa.lan kernel: md: kicking non-fresh sdd1 from array!
Oct 12 18:47:04 nas-bot.uraxa.lan kernel: md: kicking non-fresh sdd1 from array!
Oct 13 20:50:41 nas-bot.uraxa.lan kernel: md: kicking non-fresh sdd1 from array!
Oct 13 22:13:00 nas-bot.uraxa.lan kernel: md: kicking non-fresh sdd1 from array!
Oct 13 22:13:24 nas-bot.uraxa.lan kernel: md: kicking non-fresh sdd1 from array!
Oct 27 11:31:54 nas-bot.uraxa.lan kernel: md: kicking non-fresh sdb1 from array!

but I suspect that's just subsequent boots, where it found a disk that had been kicked out before and noticed it was out of sync on boot.

I also noticed a bunch of warnings that /tmp wasn't empty when mounting the tmpfs on top of it, so I looked inside and found this odd thing:

$ ls -lah rootbind/tmp/
total 16K
drwxrwxrwt 1 root root    272 Oct 28 13:44 ./
drwxr-xr-x 1 root root    134 Sep 24 12:01 ../
brw------- 1 root root 9, 127 Aug 18 19:47 .tmp.md.293:9:127
brw------- 1 root root 9, 127 Sep 17 15:53 .tmp.md.303:9:127
brw------- 1 root root 9, 127 Aug  9 21:05 .tmp.md.305:9:127
brw------- 1 root root 9, 127 Oct 28 13:44 .tmp.md.306:9:127
brw------- 1 root root 9, 127 Aug 11 16:50 .tmp.md.307:9:127
brw------- 1 root root 9, 127 Sep 30 08:38 .tmp.md.308:9:127
brw------- 1 root root 9, 127 Oct 18 18:35 .tmp.md.309:9:127
brw------- 1 root root 9, 127 Sep 10 10:06 .tmp.md.310:9:127

For some reason it's creating block devices under /tmp before /tmp is mounted, which is suspicious, but might be a red herring.

I think the disk names do line up, because every message in dmesg i can find shows that sda1 matches raid disk 0, sdb1 is raid disk 1, etc. However, I have written down the disk identifiers from fdisk in a file so I can compare the next time this happens.

I also notice a lot of these, even for boots where the raid array is healthy:

$ sudo journalctl -b 0 | grep md127
Oct 28 13:44:56 nas-bot.uraxa.lan kernel: md/raid:md127: device sdb1 operational as raid disk 1
Oct 28 13:44:56 nas-bot.uraxa.lan kernel: md/raid:md127: device sda1 operational as raid disk 0
Oct 28 13:44:56 nas-bot.uraxa.lan kernel: md/raid:md127: device sdd1 operational as raid disk 3
Oct 28 13:44:56 nas-bot.uraxa.lan kernel: md/raid:md127: device sdc1 operational as raid disk 2
Oct 28 13:44:56 nas-bot.uraxa.lan kernel: md/raid:md127: raid level 6 active with 4 out of 4 devices, algorithm 2
Oct 28 13:44:56 nas-bot.uraxa.lan systemd[1]: Started Timer to wait for more drives before activating degraded array md127..
Oct 28 13:44:56 nas-bot.uraxa.lan kernel: md127: detected capacity change from 0 to 23441149952
Oct 28 13:44:56 nas-bot.uraxa.lan systemd[1]: mdadm-last-resort@md127.timer: Deactivated successfully.
Oct 28 13:44:56 nas-bot.uraxa.lan systemd[1]: Stopped Timer to wait for more drives before activating degraded array md127..

This, is saying it found all 4 disks, started them all up and added them to the raid device but it's starting a timer to wait for more drives for the degraded array. But the array is not degraded.

Last observation: I see this warning every boot; could it mean something?

Oct 28 13:44:55 nas-bot.uraxa.lan kernel: block device autoloading is deprecated and will be removed.

I've started a long test on all four disks, and will update with findings when I get them.

Thanks for the help so far!

Last edited by scott_fakename (2023-10-28 23:19:32)

dakota · 2023-10-28 23:22:38

scott_fakename wrote:

I'm just looking at http://0x0.st/Hya6.txt

Am I missing something? Looks like those tests were over 6,300 hours ago.

$ sudo smartctl a /dev/sda
  9 Power_On_Hours          0x0032   087   087   000    Old_age   Always       -       9817
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      3491         -
------------------------------
$ sudo smartctl a /dev/sdb
  9 Power_On_Hours          0x0032   087   087   000    Old_age   Always       -       9961
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      3636         -
------------------------------
$ sudo smartctl a /dev/sdc
  9 Power_On_Hours          0x0032   087   087   000    Old_age   Always       -       9640
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      3315         -
------------------------------
$ sudo smartctl a /dev/sdd
  9 Power_On_Hours          0x0032   087   087   000    Old_age   Always       -       9941
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      3637

Cheers,

Edit -- see below

frostschutz wrote:

If there was a problem with the drives you would see some noise in the logs, but the logs you posted so far are suspiciously silent.

Last edited by dakota (2023-10-29 19:20:32)

frostschutz · 2023-10-29 07:07:55

mdadm.conf? mdadm --examine? (preferably --examine before --assemble, but it happens automatically, unless you spin up an initramfs with mdadm, but without md raid kernel modules).

mdadm creates /dev/.tmp.md or /tmp/.tmp.md internally, that by itself should not be a cause of concern, lots of programs use temporary files.

If there was a problem with the drives you would see some noise in the logs, but the logs you posted so far are suspiciously silent. So I suspect it's a configuration issue of some kind, or a bug. It would be nice to see output of mdadm --assemble directly (if you disable the automatisms, perhaps by including a 0-byte /etc/udev/rules.d/64-md-raid-assembly.rules - but systemd also has some services like mdadm-last-resort and others).

Last edited by frostschutz (2023-10-29 07:10:41)

-thc · 2023-10-29 07:57:58

scott_fakename wrote:

Sep 23 12:33:50 nas-bot.uraxa.lan kernel: md: kicking non-fresh sdd1 from array!

[...]

Oct 28 13:44:56 nas-bot.uraxa.lan systemd[1]: Started Timer to wait for more drives before activating degraded array md127..

Maybe your are looking at the wrong side of the boot process - those messages imply that maybe some things on the shutdown of the raid failed or were incomplete. You'll need to "catch" a system log of a shutdown preceding each case.

Last edited by -thc (2023-10-29 10:10:24)

seth · 2023-10-29 08:03:29

I'd highly recommend to not grep around for random tokens in the journal but to look at the startup of the raid in full context.

scott_fakename · 2023-10-29 16:53:23

Am I missing something? Looks like those tests were over 6,300 hours ago.

You're right, good catch, those were very out of date. Reran, and the fresh SMART self test assesses it ok: http://0x0.st/HyjL.txt

I'd highly recommend to not grep around for random tokens in the journal but to look at the startup of the raid in full context.

I wasn't, i was looking at it all in context but didn't see anything else specifically mentioning the disks, and only posted the journalctl logs that looked the most suspicious. Here's a complete journalctl log from a bad boot: http://0x0.st/HyjI.txt here's the journal from the boot before the most recent 3/4 disk boot: http://0x0.st/Hyeo.txt and here's from a clean, warm reboot: http://0x0.st/Hyjl.txt

Maybe your are looking at the wrong side of the boot process - those messages imply that maybe some things on the shutdown of the raid failed or were incomplete. You'll need to "catch" a system log of a shutdown preceding each case.

Interesting I wonder if what's happening is the disks spin down due to idleness, and then I power the machine off, it finishes shutting down before the disk spins up to be properly closed. Next time it boots, it finds the disk out of sync and kicks it. Something to think about.

mdadm.conf? mdadm --examine? (preferably --examine before --assemble, but it happens automatically, unless you spin up an initramfs with mdadm, but without md raid kernel modules).

My mdadm.conf contains mostly just a bunch of commented out lines and then the only non empty non comment line is:

DEVICE partitions

Last edited by scott_fakename (2023-10-29 17:32:46)

topcat01 · 2023-10-29 23:36:50

scott_fakename wrote:

Interesting I wonder if what's happening is the disks spin down due to idleness, and then I power the machine off, it finishes shutting down before the disk spins up to be properly closed. Next time it boots, it finds the disk out of sync and kicks it. Something to think about.

Interesting theory. Can you disable power saving?

seth · 2023-10-30 07:36:56

https://man.archlinux.org/man/core/hdparm/hdparm.8.en#C
But

Oct 28 13:43:39 nas-bot.uraxa.lan kernel: XFS (dm-1): Unmounting Filesystem b15d9f35-85cb-44c1-ad81-bd0a6e3b4b62
Oct 28 13:44:12 nas-bot.uraxa.lan kernel: md127: detected capacity change from 23441149952 to 0
Oct 28 13:44:12 nas-bot.uraxa.lan kernel: md: md127 stopped.
Oct 28 13:44:21 nas-bot.uraxa.lan systemd-logind[399]: The system will reboot now!

And there's really nothing in the journal suggesting anything to be wrong w/ any drive, next to the self-tests, there're also no signs of trouble (re-allocated/pending sectors) in the smart data.

You could also try to explicitly umount/stop the raid well before the shutdown.
Likewise not automounting it, but only explicitly well after the boot.

Also, stupid question:
What if you boldly mask mdadm-last-resort@md127.timer and mdadm-last-resort@md127.service ?

Edit: likely unrelated, but you're running dhcpcd and systemd-networkd concurrently; pick one, disable the other.

Last edited by seth (2023-10-30 07:53:12)

Arch Linux

#1 2023-10-27 21:34:53

Disks keep dropping out of MD raid array

#2 2023-10-28 07:33:37

Re: Disks keep dropping out of MD raid array

#3 2023-10-28 16:49:43

Re: Disks keep dropping out of MD raid array

#4 2023-10-28 19:14:49

Re: Disks keep dropping out of MD raid array

#5 2023-10-28 21:27:25

Re: Disks keep dropping out of MD raid array

#6 2023-10-28 23:06:34

Re: Disks keep dropping out of MD raid array

#7 2023-10-28 23:22:38

Re: Disks keep dropping out of MD raid array

#8 2023-10-29 07:07:55

Re: Disks keep dropping out of MD raid array

#9 2023-10-29 07:57:58

Re: Disks keep dropping out of MD raid array

#10 2023-10-29 08:03:29

Re: Disks keep dropping out of MD raid array

#11 2023-10-29 16:53:23

Re: Disks keep dropping out of MD raid array

#12 2023-10-29 23:36:50

Re: Disks keep dropping out of MD raid array

#13 2023-10-30 07:36:56

Re: Disks keep dropping out of MD raid array

Board footer