You are not logged in.
Hello, folks! I am having some trouble with MD randomly removing a partition from one of my RAID-1 arrays. So far this has happened three times over the past couple weeks, but it only happens on boot.
I have two 3TB WD SATA hard disks containing two RAID-1 volumes set up as follows:
/dev/sda1 FAT32 EFI System Partition (250MiB)
/dev/sdb1 Reserved Space (250MiB)
/dev/sda2 Disk 0 of /dev/md0 with v0.90 metadata (250MiB)
/dev/sdb2 Disk 1 of /dev/md0 with v0.90 metadata (250MiB)
/dev/sda3 swap partition 0 encrypted with dm-crypt mapped to /dev/mapper/swapA (8GiB)
/dev/sdb3 swap partition 1 encrypted with dm-crypt mapped to /dev/mapper/swapB (8GiB)
/dev/sda4 Disk 0 of /dev/md1 with v1.2 metadata (2750GiB)
/dev/sdb4 Disk 1 of /dev/md1 with v1.2 metadata (2750GiB)
/dev/md1 is encrypted with LUKS and mapped to /dev/mapper/root
/dev/mapper/root is the only volume in the LVM volume group rootvg
the volume group rootvg currently has three logical volumes
logical volume root is mapped to /dev/mapper/rootvg-root and is the file system root
logical volume home is mapped to /dev/mapper/rootvg-home and is mounted /home
logical volume var is mapped to /dev/mapper/rootvg-var and is mounted /var
I also wrote this initcpio hook (I call it 'secdec'), with a bit of help from the Arch Wiki, which decrypts /dev/mapper/root, optionally with a openssl encrypted key file on a USB memory stick:
run_hook ()
{
local keyCopyDec keyCopyEnc keyMountPoint maxTries retryDelay \
shutdownOnFail
# customizable ############################################################
keyCopyDec="/crypto_keyfile.bin" # temp storage for decrypted key data
keyCopyEnc="/crypto_keyfile.enc" # temp storage for encrypted key data
keyMountPoint="/ckey" # key storage device mount point
maxTries=3 # max number of decrypt attempts
retryDelay=2 # delay in seconds between retries
shutdownOnFail=0 # shut down computer if decrypt fails
#+0=yes, 1=no
# /customizable ###########################################################
local abortMsg passPromptKey passPromptVol passWrong secdecFormat \
shutdownMsg
local E_NOFILE
local KEY NOKEY SSLKEY
local CSQUIET OIFS
local cryptDev cryptName keyDev keyFile keyFs keyType pass passPrompt \
success tries
abortMsg="Aborting..."
passPromptKey="Key passphrase: "
passPromptVol="LUKS passphrase: "
passWrong="Invalid passphrase."
secdecFormat="secdec=cryptdev:dmname:keydev:keyfs:keyfile"
shutdownMsg="Shutting down computer. You may try again later."
E_NOFILE=66
KEY=1
NOKEY=0
SSLKEY=2
OIFS=$IFS
[ "$(echo "${quiet}" | awk '{print tolower($0)}')" == "y" ] && \
CSQUIET=">/dev/null 2>&1"
###########################################################################
askForBooleanInput ()
# Ask the user for boolean input.
#
# $1: The question to pose.
#
# $2: A string containing single characters, separated by spaces, which are
#+keys pressed that would return boolean true (0).
#
# $3: A string containing single characters, separated by spaces, which are
#+keys pressed that would return a boolean false (1).
#
# Returns 0 if the user presses a key that generates a character found in
#+$2.
#
# Returns 1 if the user presses a key that generates a character found in
#+$3.
#
# Returns 2 if an incorrect number of parameters was provided.
{
local keyin
[ ${#} -ne 3 ] && return 2
echo -n "$1"
while true; do
read -sn1 keyin
case "$keyin" in
[$2]) echo "$keyin"; keyin=0; break;;
[$3]) echo "$keyin"; keyin=1; break;;
*) echo -n -e "\a";;
esac
done
return $keyin
}
###########################################################################
askForPass ()
# Ask the user to enter a pass{word|phrase}.
#
# $1: The prompt to display.
#
# $2: The name of the variable to assign the input pass{word|phrase} to.
#+For example, to assign to $pass, $2 should be "pass".
#
# Returns 0 on success or 1 if there is a parameter error.
{
[ ${#} -ne 2 ] || [ -z "${1}" ] || [ -z "${2}" ] && return 1
read -rsp "$1" "$2"
echo
}
###########################################################################
isSsl ()
# Examine a file for indications that it is SSL encrypted.
#
# $1: Path to the key file to examine.
#
# Returns 0 if the file appears to be SSL encrypted, 1 if the file does not
#+appear to be SSL encrypted and $E_NOFILE if $1 is not a regular file.
{
[ ! -f "${1}" ] && return $E_NOFILE
[ "$(dd if="${1}" bs=1 count=8 2>/dev/null | \
awk '{print tolower($0)}')" == "salted__" ]
}
###########################################################################
getKey ()
# Attempt to find and copy a key from $keyDev to $keyCopyEnc.
#
# $1: Path to the device containing the key data.
#
# $2: Name of file system containing the key file.
#
# $3: Path to the key file, relative to $4.
#
# $4: Mount point of partition containing key file.
#
# $5: Path to temporary copy of key file.
#
# $6: Boolean value indicating whether to allow user the opportunity to
#+switch key devices before attempting to find a key. This is useful if
#+a key has already been tried and failed. The user could switch memory
#+devices before trying again. 0=true, 1=false; default is false.
#
# Returns one of $KEY, $NOKEY or $SSLKEY depending on what was found.
{
local result wait
if [ -z "${6}" ] || [ ${6} -eq 1 ]; then wait=1; else wait=0; fi
mkdir -p "$4" >/dev/null 2>&1
while true; do
if [ ${wait} -eq 0 ]; then
askForBooleanInput \
"(S)earch for key or (R)evert to LUKS passphrase? " "s S" "r R"
if [ ${?} -eq 0 ]; then result=$KEY; else result=$NOKEY; fi
wait=1
else
result=$KEY
fi
if [ ${result} -eq ${KEY} ]; then
if poll_device "${1}" ${rootdelay}; then
mount -r -t "$2" "$1" "$4" >/dev/null 2>&1
dd if="$4/$3" of="$5" >/dev/null 2>&1
umount "$4" >/dev/null 2>&1
if [ -f "${5}" ]; then
isSsl "${5}" && result=$SSLKEY
else
err "Key $3 not found."
unset result
wait=0
fi
else
err "Key device $1 not found."
unset result
wait=0
fi
fi
[ -n "${result}" ] && break
done
return $result
}
###########################################################################
# If the secdec kernel parameter was not specified, inform the user, but
#+allow init to continue in case another hook will work.
if [ -z "${secdec}" ]; then
echo "Missing parameter: $secdecFormat"
return 0
fi
# Make sure required kernel modules are available.
if ! /sbin/modprobe -a -q dm-crypt >/dev/null 2>&1 || \
[ ! -e "/sys/class/misc/device-mapper" ]; then
err "Required kernel modules not available."
err "$abortMsg"
exit 1
fi
if [ ! -e "/dev/mapper/control" ]; then
mkdir -p "/dev/mapper" >/dev/null 2>&1
mknod "/dev/mapper/control" c \
$(cat /sys/class/misc/device-mapper/dev | sed 's|:| |') >/dev/null 2>&1
fi
# Parse the secdec kernel parameter, check it's format, make sure $cryptDev
#+is available, and that it contains a LUKS volume.
IFS=:
read cryptDev cryptName keyDev keyFs keyFile <<EOF
$secdec
EOF
IFS=$OIFS
if [ $(echo "${secdec}" | awk -F: '{print NF}') -ne 5 ] || \
[ -z "${cryptDev}" ] || [ -z "${cryptName}" ]; then
err "Verify parameter format: $secdecFormat"
err "$abortMsg"
exit 1
fi
if ! poll_device "${cryptDev}" ${rootdelay}; then
err "Device $cryptDev not available."
err "$abortMsg"
exit 1
fi
# Inform the user that $cryptDev doesn't contain a LUKS volume, but allow
#+init to continue, in case another hook can handle this.
if ! /sbin/cryptsetup isLuks "${cryptDev}" >/dev/null 2>&1; then
echo "Device $cryptDev does not contain a LUKS volume."
return 0
fi
# Attempt to open the LUKS volume.
tries=0
unset keyType
while true; do
success=1
# Attempt to copy a decryption key.
if [ -z ${keyType} ]; then
getKey "$keyDev" "$keyFs" "$keyFile" "$keyMountPoint" \
"$keyCopyEnc" 1
keyType=$?
elif [ ${keyType} -eq ${KEY} ]; then
getKey "$keyDev" "$keyFs" "$keyFile" "$keyMountPoint" \
"$keyCopyEnc" 0
keyType=$?
elif [ ${keyType} -eq ${SSLKEY} ]; then
if askForBooleanInput "(U)se a different key or (T)ry again? " \
"u U" "t T"; then
getKey "$keyDev" "$keyFs" "$keyFile" "$keyMountPoint" \
"$keyCopyEnc" 0
keyType=$?
fi
fi
# Open the LUKS volume.
if [ ${keyType} -eq ${NOKEY} ]; then
askForPass "$passPromptVol" "pass"
/sbin/cryptsetup luksOpen "$cryptDev" "$cryptName" "$CSQUIET" <<EOF
$pass
EOF
success=$?
[ ${success} -ne 0 ] && err "$passWrong"
else
if [ ${keyType} -eq ${SSLKEY} ]; then
askForPass "$passPromptKey" "pass"
/sbin/openssl aes256 -pass pass:"$pass" -d -in "$keyCopyEnc" \
-out "$keyCopyDec" >/dev/null 2>&1
if [ ${?} -ne 0 ]; then
rm -f "$keyCopyDec" >/dev/null 2>&1
err "$passWrong"
fi
else
mv "$keyCopyEnc" "$keyCopyDec" >/dev/null 2>&1
fi
if [ -f "${keyCopyDec}" ]; then
/sbin/cryptsetup --key-file "$keyCopyDec" \
luksOpen "$cryptDev" "$cryptName" "$CSQUIET"
success=$?
fi
fi
[ ${success} -ne 0 ] && err "Failed to open LUKS volume."
tries=$(( $tries + 1 ))
[ ${tries} -ge ${maxTries} ] || [ ${success} -eq 0 ] && break
sleep "$retryDelay"
done
if [ ${success} -eq 0 ]; then
if [ ! -e "/dev/mapper/${cryptName}" ]; then
err "LUKS volume was opened, but failed to map to $cryptName."
err "$abortMsg"
exit 1
fi
echo "LUKS volume opened."
else
if [ ${shutdownOnFail} -eq 0 ]; then
echo "shutdownMsg"
poweroff -f
fi
exit 1
fi
}
The failing array is /dev/md1 and mdadm is reporting the following:
mdadm --detail /dev/md1
/dev/md1:
Version : 1.2
Creation Time : Wed May 30 18:50:05 2012
Raid Level : raid1
Array Size : 2921467179 (2786.13 GiB 2991.58 GB)
Used Dev Size : 2921467179 (2786.13 GiB 2991.58 GB)
Raid Devices : 2
Total Devices : 1
Persistence : Superblock is persistent
Update Time : Wed Sep 12 03:34:52 2012
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0
Name : archiso:1
UUID : 8ad37e84:f7261906:da3d317e:24080362
Events : 44661
Number Major Minor RaidDevice State
0 8 4 0 active sync /dev/sda4
1 0 0 1 removed
mdadm --examine /dev/sda4
/dev/sda4:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 8ad37e84:f7261906:da3d317e:24080362
Name : archiso:1
Creation Time : Wed May 30 18:50:05 2012
Raid Level : raid1
Raid Devices : 2
Avail Dev Size : 5842934631 (2786.13 GiB 2991.58 GB)
Array Size : 2921467179 (2786.13 GiB 2991.58 GB)
Used Dev Size : 5842934358 (2786.13 GiB 2991.58 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : abba1dc3:3fadf7a7:be452bb5:b8bbe97b
Update Time : Wed Sep 12 03:37:48 2012
Checksum : aad3e44b - correct
Events : 44729
Device Role : Active device 0
Array State : A. ('A' == active, '.' == missing)
mdadm --examine /dev/sdb4/dev/sdb4:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 8ad37e84:f7261906:da3d317e:24080362
Name : archiso:1
Creation Time : Wed May 30 18:50:05 2012
Raid Level : raid1
Raid Devices : 2
Avail Dev Size : 5842934631 (2786.13 GiB 2991.58 GB)
Array Size : 2921467179 (2786.13 GiB 2991.58 GB)
Used Dev Size : 5842934358 (2786.13 GiB 2991.58 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 09a09f49:b329feaa:3341111b:47b484fe
Update Time : Wed Sep 12 01:50:34 2012
Checksum : 1cdc19c0 - correct
Events : 42869
Device Role : Active device 1
Array State : AA ('A' == active, '.' == missing)
cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda2[0] sdb2[1]
204736 blocks [2/2] [UU]
md1 : active raid1 sda4[0]
2921467179 blocks super 1.2 [2/1] [U_]
unused devices: <none>
This is md log info of the first reboot after my first recovery:
Sep 9 20:18:03 localhost kernel: [ 1.784225] md: raid1 personality registered for level 1
Sep 9 20:18:03 localhost kernel: [ 2.552971] md: md1 stopped.
Sep 9 20:18:03 localhost kernel: [ 2.553418] md: bind<sdb4>
Sep 9 20:18:03 localhost kernel: [ 2.553574] md: bind<sda4>
Sep 9 20:18:03 localhost kernel: [ 2.554080] md/raid1:md1: active with 2 out of 2 mirrors
Sep 9 20:18:03 localhost kernel: [ 2.554093] md1: detected capacity change from 0 to 2991582391296
Sep 9 20:18:03 localhost kernel: [ 2.566266] md1: unknown partition table
Sep 9 20:18:03 localhost kernel: [ 2.617922] md: md0 stopped.
Sep 9 20:18:03 localhost kernel: [ 2.618382] md: bind<sdb2>
Sep 9 20:18:03 localhost kernel: [ 2.618525] md: bind<sda2>
Sep 9 20:18:03 localhost kernel: [ 2.619175] md/raid1:md0: active with 2 out of 2 mirrors
Sep 9 20:18:03 localhost kernel: [ 2.619203] md0: detected capacity change from 0 to 209649664
Sep 9 20:18:03 localhost kernel: [ 10.933334] md0: unknown partition table
And this is the next time I rebooted:
Sep 10 19:59:07 localhost kernel: [ 1.780481] md: raid1 personality registered for level 1
Sep 10 19:59:07 localhost kernel: [ 2.806037] md: md1 stopped.
Sep 10 19:59:07 localhost kernel: [ 2.806345] md: bind<sda4>
Sep 10 19:59:07 localhost kernel: [ 2.806888] md/raid1:md1: active with 1 out of 2 mirrors
Sep 10 19:59:07 localhost kernel: [ 2.806898] md1: detected capacity change from 0 to 2991582391296
Sep 10 19:59:07 localhost kernel: [ 2.820308] md1: unknown partition table
Sep 10 19:59:07 localhost kernel: [ 2.956599] md: md0 stopped.
Sep 10 19:59:07 localhost kernel: [ 2.957149] md: bind<sdb2>
Sep 10 19:59:07 localhost kernel: [ 2.957269] md: bind<sda2>
Sep 10 19:59:07 localhost kernel: [ 2.958086] md/raid1:md0: active with 2 out of 2 mirrors
Sep 10 19:59:07 localhost kernel: [ 2.958100] md0: detected capacity change from 0 to 209649664
Sep 10 19:59:07 localhost kernel: [ 11.742281] md0: unknown partition table
In between these two boots there are no reports of md failures. For some reason, its just dropping the second partition.
I just did a restoration earlier today and on the very next boot, md refuses to use /dev/sdb4. Once I booted, I checked update times (not the ones listed) and /dev/sda4 and /dev/sdb4 were about 4 minutes apart. Since it takes only about a minute for Arch to reboot, including me typing my openssl key password, Arch was running for about 3 minutes without updating. I'm assuming this is of some significance since /dev/md0 reports perfect synchronization.
All of this has been working very well for me for about 6 months now. Both hard drives, which I bought at the same time, are about 9 months old. I checked both drives using smartctl, and both report SMART enabled and passed. SMART attribute data doesn't make a lot of sense to me and I haven't looked up the format, but the reallocation event value is the same as the day I bought the drives, so I'm kind of assuming things are ok there, or at least bad sectors aren't being created.
I hope I've provided all required details here. Any help would be appreciated.
Last edited by cng1024 (2012-09-13 00:02:43)
Offline
It would seem that nobody has any ideas about why this may be happening. I've done a complete diagnostic of both hard disks and both passed. I then downloaded the latest iso, formatted and reinstalled, but that failed after one reboot.
I have a theory as to why it may be happening, but I haven't tested it yet. Perhaps someone can tell me if I may be on the right track here. I got to wondering what happens during shutdown. The root fs remounts ro during shutdown, but it does remain mounted which means everything under that, my logical volumes, LUKS and RAID are all still open when the system halts. I saved my original configs before I reformatted and it turns out I forgot to add the shutdown hook to the initcpio image which means my RAID wasn't being stopped before halt.
I'm going to try a few experiments to see if adding the shutdown hook makes a difference. Hopefully I'm right, and I'll update either way, but I'd still appreciate it if someone with a bit more experience could weigh in on this.
Offline