You are not logged in.
I have a server with a ZFS raidz-2 styorage pool containing 4 hard drives. No spare ports, archlinux is on an external boot device (usb drive).
Yesterday one of the physical hard drives failed rather suddently (lots of read/write errors show in dmesg). ZFS resilvered the data on the three reamining drive and removed the failed one from the pool. zpool status report no data losses, even though the pool is a degraded state now:
pool: zfsdatapool
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-4J
scan: resilvered 1.36M in 0h19m with 0 errors on Wed Oct 1 03:59:16 2014
config:
NAME STATE READ WRITE CKSUM
zfsdatapool DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
ata-ST2000DM001-1CH164_W1E3XC6P ONLINE 0 0 0
ata-ST2000DM001-1CH164_W1E3XBWV UNAVAIL 0 0 0
ata-ST2000DM001-1CH164_W1E3XC4D ONLINE 0 0 0
ata-ST2000DM001-1CH164_W1E3XCK2 ONLINE 0 0 0
errors: No known data errors
I ordered a new hard drive and I will replace it later today. My question is how to do it---I am rather new to ZFS and have done this before. According to http://zfsonlinux.org/msg/ZFS-8000-9P/,
after physically swapping in the new drive for the failed one, all iI need to do is to issue:
# zpool replace zfsdatapool sdb
After which ZFS will proceed to resilver the data across the restored 4-drive storage pool.
Is that really all there is? Any precautions I should take before proceeding (besides backing up, which is unfortunately not an option in this case)?
Last edited by stefano (2014-10-05 13:22:56)
Offline
Update:
Actually, there is something else going on, besides the failed drive, that puzzles me. lsblk now reports only 4 drives total (the three still attached to the zfs storage pool , plus the external usb drive where archlinux resides):
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
fd0 2:0 1 4K 0 disk
sda 8:0 0 1.8T 0 disk
|-sda1 8:1 0 1.8T 0 part
`-sda9 8:9 0 8M 0 part
sdb 8:16 0 55.9G 0 disk
`-sdb1 8:17 0 55G 0 part /
sdc 8:32 0 1.8T 0 disk
|-sdc1 8:33 0 1.8T 0 part
`-sdc9 8:41 0 8M 0 part
sdd 8:48 0 1.8T 0 disk
|-sdd1 8:49 0 1.8T 0 part
`-sdd9 8:57 0 8M 0 part
sr0 11:0 1 1024M 0 rom
Interestingly, the external (boot) usb drive is now at sdb, where the failed drive used to be. The drives in the ZFS pool are four identical 1.8Tb drives, the usb boot device is a 55Gb hard drive. I don't know what to make of this situation. Perhaps it is the motherboard that's failing, and not the hard drive?
Offline
You're using disk-ids, so you need to use the disk-ids on the replace command. You'll need to fully qualify the disk-ids with /dev/disk/by-id/
I don't know what your replacement drive's id is, so update the following:
zpool replace zfsdatapool /dev/disk/by-id/ata-ST2000DM001-1CH164_W1E3XBWV /dev/disk/by-id/<replacement drive id>
If the command errors with 'no such device in pool' try symlinking it to /dev/null and retry:
ln -s /dev/null /dev/disk/by-id/ata-ST2000DM001-1CH164_W1E3XBWV
Offline
Thanks, it worked perfectly once I added the -f flag to the replace command (as suggested in the ZFS wiki page for EFI-related problems).
For future reference, I just swapped in a new drive where the old one was, checked the new UUID with
ls -lah /dev/disk/by-id
and then replaced the new drive for the old in the storage pool with:
zpool replace zfsdatapool </dev/disk/by-id/failedDrive-UUID> </dev/disk/by-id/newDrive-UUID>
It took ZFS about 12 hours to resilver the new drive, but now it's back to normal.
Offline