[SOLVED] How to replace a failed drive in a ZFS pool

stefano · 2014-10-03 14:44:49

I have a server with a ZFS raidz-2 styorage pool containing 4 hard drives. No spare ports, archlinux is on an external boot device (usb drive).

Yesterday one of the physical hard drives failed rather suddently (lots of read/write errors show in dmesg). ZFS resilvered the data on the three reamining drive and removed the failed one from the pool. zpool status report no data losses, even though the pool is a degraded state now:

  pool: zfsdatapool
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: resilvered 1.36M in 0h19m with 0 errors on Wed Oct  1 03:59:16 2014
config:

        NAME                                 STATE     READ WRITE CKSUM
        zfsdatapool                          DEGRADED     0     0     0
          raidz2-0                           DEGRADED     0     0     0
            ata-ST2000DM001-1CH164_W1E3XC6P  ONLINE       0     0     0
            ata-ST2000DM001-1CH164_W1E3XBWV  UNAVAIL      0     0     0
            ata-ST2000DM001-1CH164_W1E3XC4D  ONLINE       0     0     0
            ata-ST2000DM001-1CH164_W1E3XCK2  ONLINE       0     0     0

errors: No known data errors

I ordered a new hard drive and I will replace it later today. My question is how to do it---I am rather new to ZFS and have done this before. According to http://zfsonlinux.org/msg/ZFS-8000-9P/,
after physically swapping in the new drive for the failed one, all iI need to do is to issue:

# zpool replace zfsdatapool sdb

After which ZFS will proceed to resilver the data across the restored 4-drive storage pool.

Is that really all there is? Any precautions I should take before proceeding (besides backing up, which is unfortunately not an option in this case)?

Last edited by stefano (2014-10-05 13:22:56)

stefano · 2014-10-03 15:09:35

Update:

Actually, there is something else going on, besides the failed drive, that puzzles me. lsblk now reports only 4 drives total (the three still attached to the zfs storage pool , plus the external usb drive where archlinux resides):

NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
fd0      2:0    1    4K  0 disk 
sda      8:0    0  1.8T  0 disk 
|-sda1   8:1    0  1.8T  0 part 
`-sda9   8:9    0    8M  0 part 
sdb      8:16   0 55.9G  0 disk 
`-sdb1   8:17   0   55G  0 part /
sdc      8:32   0  1.8T  0 disk 
|-sdc1   8:33   0  1.8T  0 part 
`-sdc9   8:41   0    8M  0 part 
sdd      8:48   0  1.8T  0 disk 
|-sdd1   8:49   0  1.8T  0 part 
`-sdd9   8:57   0    8M  0 part 
sr0     11:0    1 1024M  0 rom

Interestingly, the external (boot) usb drive is now at sdb, where the failed drive used to be. The drives in the ZFS pool are four identical 1.8Tb drives, the usb boot device is a 55Gb hard drive. I don't know what to make of this situation. Perhaps it is the motherboard that's failing, and not the hard drive?

ukhippo · 2014-10-03 18:42:07

You're using disk-ids, so you need to use the disk-ids on the replace command. You'll need to fully qualify the disk-ids with /dev/disk/by-id/
I don't know what your replacement drive's id is, so update the following:

zpool replace zfsdatapool /dev/disk/by-id/ata-ST2000DM001-1CH164_W1E3XBWV /dev/disk/by-id/<replacement drive id>

If the command errors with 'no such device in pool' try symlinking it to /dev/null and retry:

ln -s /dev/null /dev/disk/by-id/ata-ST2000DM001-1CH164_W1E3XBWV

stefano · 2014-10-05 13:22:39

Thanks, it worked perfectly once I added the -f flag to the replace command (as suggested in the ZFS wiki page for EFI-related problems).

For future reference, I just swapped in a new drive where the old one was, checked the new UUID with

ls -lah /dev/disk/by-id

and then replaced the new drive for the old in the storage pool with:

zpool replace zfsdatapool </dev/disk/by-id/failedDrive-UUID> </dev/disk/by-id/newDrive-UUID>

It took ZFS about 12 hours to resilver the new drive, but now it's back to normal.

Arch Linux

#1 2014-10-03 14:44:49

[SOLVED] How to replace a failed drive in a ZFS pool

#2 2014-10-03 15:09:35

Re: [SOLVED] How to replace a failed drive in a ZFS pool

#3 2014-10-03 18:42:07

Re: [SOLVED] How to replace a failed drive in a ZFS pool

#4 2014-10-05 13:22:39

Re: [SOLVED] How to replace a failed drive in a ZFS pool

Board footer