hard drive array losing access - suspect controller - zfs

wolfdogg · 2013-07-21 19:31:05

i am having a problem with one of my arrays, this is a zfs fielsystem. It consists of a 1x500GB, 2x750GB, and 1x2TB, linear array. the pool is named 'pool'. I have to mention here, i dont have enough hard drive to have a raidz (raid5) setup yet, so there is no actual redundancy to so the zfs cant auto repair itself from a copy because there is none, therefore all auto repair features can be thrown out the door in this equation meaning i believe its possible that the filesystem can easily be corrupted by the controller in this specific case which i suspect. Please keep that in mind while reading the following.

I just upgraded my binaries, therefore i removed zfs-fuse and installed archzfs. did i remove it completely? not sure. i wasnt able to get my array back up and running until i fiddled with the sata cables, moved around the sata connectors, tinkered with bios drive detect. after i got it running, i copied some files off of it from samba thinking it might not last long. the copy was succesfull, but problems began surfacing again shortly after. so now i suspect i have a bad controller on my gigabyte board. I round recently someone else who had this issue so im thinking its not the hard drive.

I did some smartmontools tests last night and found that ll drives are showing good on a short test, they all passed. today im not having so much luck with getting access. there is hangs on reboot, and the drive light stays on. when i try to run zfs and zpool commands its stating the system is hanging. i have been getting what appears as HD errors as well, ill have to manually type them in here since no copy and paste from the console to the maching im posting from, and the errors arent showing up via ssh or i would copy them from my terminal tha ti currently have open to here.

ata7: SRST failed (errno=-16)
reset failed, giving up,
end_request I/O error, dev sdc, sector 637543760
' ' ' ' '''' ' ' ''' sector 637543833
sd 6:0:0:0 got wrong page
' ' ' ' ' '' asking for cache data failed
' ' ' ' ' '  assuming drive cache: write through
info task txg_sync:348 blocked for more than 120 seconds

and so forth, and when i boot i see this each time which is making me feel that the HD is going bad, however i still want to believe its the controller.
Note, it seems only those two sectors show up, is it possible that the controller shot out those two sectors with bad data? {Note, i have had a windows system prior installed on this motherboard and after a few months of running lost a couple raid arrays of data as well.}

failed command: WRITE DMA EXT
... more stuff here...
ata7.00 error DRDY ERR 
ICRC ABRT

blah blah blah.

so now i can give you some info from the diagnosis that im doing on it, copied from a shell terminal. Note the following metadata errors JUST appeared after i was trying to delete some files, copying didnt cause this, so it apears either something is currently degrading, or it just inevitably happened from a bad controller

[root@falcon wolfdogg]# zpool status -v
  pool: pool
 state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://zfsonlinux.org/msg/ZFS-8000-HC
  scan: resilvered 33K in 0h0m with 0 errors on Sun Jul 21 03:52:53 2013
config:

        NAME                                         STATE     READ WRITE CKSUM
        pool                                         ONLINE       0    26     0
          ata-ST2000DM001-9YN164_W1E07E0G            ONLINE       6    41     0
          ata-ST3750640AS_5QD03NB9                   ONLINE       0     0     0
          ata-ST3750640AS_3QD0AD6E                   ONLINE       0     0     0
          ata-WDC_WD5000AADS-00S9B0_WD-WCAV93917591  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x0>
        <metadata>:<0x1>
        <metadata>:<0x14>
        <metadata>:<0x15>
        <metadata>:<0x16d>
        <metadata>:<0x171>
        <metadata>:<0x277>
        <metadata>:<0x179>

if one of the devices are faulted, then why are they all 4 stating online???

	[root@falcon dev]# smartctl -a /dev/sdc
smartctl 6.1 2013-03-16 r3800 [x86_64-linux-3.9.9-1-ARCH] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               /6:0:0:0
Product:
User Capacity:        600,332,565,813,390,450 bytes [600 PB]
Logical block size:   774843950 bytes
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

my drive list

[root@falcon wolfdogg]# ls -lah /dev/disk/by-id/
total 0
drwxr-xr-x 2 root root 280 Jul 21 03:52 .
drwxr-xr-x 4 root root  80 Jul 21 03:52 ..
lrwxrwxrwx 1 root root   9 Jul 21 03:52 ata-_NEC_DVD_RW_ND-2510A -> ../../sr0
lrwxrwxrwx 1 root root   9 Jul 21 03:52 ata-ST2000DM001-9YN164_W1E07E0G -> ../../sdc
lrwxrwxrwx 1 root root   9 Jul 21 03:52 ata-ST3250823AS_5ND0MS6K -> ../../sdb
lrwxrwxrwx 1 root root  10 Jul 21 03:52 ata-ST3250823AS_5ND0MS6K-part1 -> ../../sdb1
lrwxrwxrwx 1 root root  10 Jul 21 03:52 ata-ST3250823AS_5ND0MS6K-part2 -> ../../sdb2
lrwxrwxrwx 1 root root  10 Jul 21 03:52 ata-ST3250823AS_5ND0MS6K-part3 -> ../../sdb3
lrwxrwxrwx 1 root root  10 Jul 21 03:52 ata-ST3250823AS_5ND0MS6K-part4 -> ../../sdb4
lrwxrwxrwx 1 root root   9 Jul 21 03:52 ata-ST3750640AS_3QD0AD6E -> ../../sde
lrwxrwxrwx 1 root root   9 Jul 21 03:52 ata-ST3750640AS_5QD03NB9 -> ../../sdd
lrwxrwxrwx 1 root root   9 Jul 21 03:52 ata-WDC_WD5000AADS-00S9B0_WD-WCAV93917591 -> ../../sda
lrwxrwxrwx 1 root root   9 Jul 21 03:52 wwn-0x5000c50045406de0 -> ../../sdc
lrwxrwxrwx 1 root root   9 Jul 21 03:52 wwn-0x50014ee1ad3cc907 -> ../../sda

and this one i dont get

[root@falcon dev]# zfs list
no datasets available

i remember creating a dataset last year, why is it reporting none, but still working
is anybody seeing any patterns here? im prepared to destroy the pool and recreate it just to see if its bad data.But what im thinking to do now is since the problem appears to only be happening on the 2TB drive, either the controller just cant handle it, or the drive is bad. So, to rule out the controller there might be hope. I have a scsi card (pci to sata) connected that one of the drives in the array is connected to since i only have 4 sata slots on the mobo, and i keep the 500GB connected to there and have not yet tried the 2tb there yet. So if i connect this 2TB drive to the scsi i should see the problems disappear, unless the drive got corrupted already.

Does any experience in the arch forums know whats going on here? did i mess up by not completely removing zfs-fuse, is my HD going bad, is my controller bad, or did ZFS just get misconfigured?

Last edited by wolfdogg (2013-07-21 19:38:51)

wolfdogg · 2013-07-21 19:49:28

ok, something interesting happened when i connected it (the badly reacting 2TB drive) to the scsi pci card. first of all no errors on boot.... then take a look at this, some clues to some remanants to the older zfs-fuse setup, and a working pool.

[root@falcon wolfdogg]# zfs list
NAME                  USED  AVAIL  REFER  MOUNTPOINT
pool                 2.95T   636G    23K  /pool
pool/backup          2.95T   636G  3.49G  /backup
pool/backup/falcon   27.0G   636G  27.0G  /backup/falcon
pool/backup/redtail  2.92T   636G  2.92T  /backup/redtail
[root@falcon wolfdogg]# zpool status
  pool: pool
 state: ONLINE
status: The pool is formatted using a legacy on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on software that does not support
        feature flags.
  scan: resilvered 33K in 0h0m with 0 errors on Sun Jul 21 04:52:52 2013
config:

        NAME                                         STATE     READ WRITE CKSUM
        pool                                         ONLINE       0     0     0
          ata-ST2000DM001-9YN164_W1E07E0G            ONLINE       0     0     0
          ata-ST3750640AS_5QD03NB9                   ONLINE       0     0     0
          ata-ST3750640AS_3QD0AD6E                   ONLINE       0     0     0
          ata-WDC_WD5000AADS-00S9B0_WD-WCAV93917591  ONLINE       0     0     0

errors: No known data errors

am i looking at a bios update needed here so the controller can talk to the 2TB properly?

Last edited by wolfdogg (2013-07-21 19:50:18)

Arch Linux

#1 2013-07-21 19:31:05

hard drive array losing access - suspect controller - zfs

#2 2013-07-21 19:49:28

Re: hard drive array losing access - suspect controller - zfs

Board footer