Harddrive trouble after accidentally cutting off power

mhelvens · 2015-10-13 19:43:56

Hi all,

Earlier today, while vacuuming, I accidentally knocked loose the power cable of my Arch box. It also got connected/disconnected for a split second again when I moved the box to investigate.

When starting the PC properly again, I get a whole bunch of dmesg error messages output on my root prompt. It seems there's something wrong with my main SSD partition `/dev/sda3` (btrfs). Here's a fragment, which is repeated almost endlessly (full output on pastebin):

[10311.906582] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[10311.906585] ata1.00: irq_stat 0x40000001
[10311.906586] ata1.00: failed command: READ DMA
[10311.906588] ata1.00: cmd c8/00:20:a0:a4:39/00:00:00:00:00/e9 tag 5 dma 16384 in
                        res 51/40:20:a4:a4:39/00:00:09:00:00/40 Emask 0x9 (media error)
[10311.906589] ata1.00: status: { DRDY ERR }
[10311.906590] ata1.00: error: { UNC }
[10311.910566] ata1.00: configured for UDMA/133
[10311.910586] sd 0:0:0:0: [sda] UNKNOWN Result: hostbyte=0x00 driverbyte=0x08
[10311.910599] sd 0:0:0:0: [sda] Sense Key : 0x3 [current] [descriptor]
[10311.910600] sd 0:0:0:0: [sda] ASC=0x11 ASCQ=0x4
[10311.910600] sd 0:0:0:0: [sda] CDB: 
[10311.910601] cdb[0]=0x28: 28 00 09 39 a4 a0 00 00 20 00
[10311.910604] blk_update_request: I/O error, dev sda, sector 154772640
[10311.910606] BTRFS: bdev /dev/sda3 errs: wr 0, rd 1009, flush 0, corrupt 0, gen 0

However, if I just ignore those scrolling messages, everything seems to be working fine. I can work in Gnome without being any the wiser. But I'm still worried, of course.

Have I damaged my SSD? Can it be repaired?

Thanks!

Edit: I just encountered my first actual problem coming from this. I tried to save meta-data on an e-book with Calibre. I got this error:

calibre, version 2.23.0
ERROR: Unhandled exception: <b>OSError</b>:[Errno 116] Stale file handle: '/home/mhelvens/Books/Jarrod Overson'

calibre 2.23  isfrozen: False is64bit: True
Linux-3.19.3-3-ARCH-x86_64-with-glibc2.2.5 Linux ('64bit', 'ELF')
('Linux', '3.19.3-3-ARCH', '#1 SMP PREEMPT Wed Apr 8 14:10:00 CEST 2015')
Python 2.7.9
Linux: ('', '', '')
Successfully initialized third party plugins: DeDRM
Traceback (most recent call last):
  File "/usr/lib/calibre/calibre/gui2/metadata/single.py", line 562, in accept
    if not self.apply_changes():
  File "/usr/lib/calibre/calibre/gui2/metadata/single.py", line 537, in apply_changes
    widget.commit(self.db, self.book_id)
  File "/usr/lib/calibre/calibre/gui2/metadata/basic_widgets.py", line 213, in commit
    getattr(db, 'set_'+ self.TITLE_ATTR)(id_, title, notify=False)
  File "/usr/lib/calibre/calibre/db/legacy.py", line 814, in func
    ret = self.new_api.set_field(field, {book_id:val})
  File "/usr/lib/calibre/calibre/db/cache.py", line 57, in call_func_with_lock
    return func(*args, **kwargs)
  File "/usr/lib/calibre/calibre/db/cache.py", line 1044, in set_field
    self._update_path(dirtied, mark_as_dirtied=False)
  File "/usr/lib/calibre/calibre/db/cache.py", line 1058, in update_path
    self.backend.update_path(book_id, title, author, self.fields['path'], self.fields['formats'])
  File "/usr/lib/calibre/calibre/db/backend.py", line 1492, in update_path
    os.makedirs(tpath)
  File "/usr/lib/python2.7/os.py", line 150, in makedirs
    makedirs(head, mode)
  File "/usr/lib/python2.7/os.py", line 157, in makedirs
    mkdir(name, mode)
OSError: [Errno 116] Stale file handle: '/home/mhelvens/Books/Jarrod Overson'

Last edited by mhelvens (2015-10-13 19:48:41)

graysky · 2015-10-13 19:51:18

Boot to a live environment and fsck each partition to check for damage. For ext4 it's `e2fsck -fv /dev/sdxy`

mhelvens · 2015-10-13 20:17:03

graysky wrote:

Boot to a live environment and fsck each partition to check for damage. For ext4 it's `e2fsck -fv /dev/sdxy`

I ran `btrfs check --repair /dev/sda3`. I got a bunch of dmesg errors like the following (copied by hand):

[...] ata1.00 exception Emask 0x0 SAct 0x1000 SErr 0x0 action 0x0
[...] ata1.00: irq_stat 0x40000008
[...] ata1.00: failed command: READ FPDMA QUEUED
[...] ata1.00: cmd 60/01:60:a6:a4:39/00:00:09:00:00/40 tag 12 ncq 512 in
[...]          res 41/40:01:a6:a4:39/00:00:09:00:00/40 Emask 0x409 (media error) <F>
[...] ata1.00: status: { DRDY ERR }
[...] ata1.00: error: { UNC }
[...] end_request: I/O error, dev sda, sector 154772646
[...] Buffer I/O error on device sda3, logical block 85564582

Then it finished quickly with the following output (copied by hand):

Check tree block failed, want=43809062912, have=0
read block failed check_tree_block
failed to repair damaged filesystem, aborting

Doesn't sound good. Advice would be most appreciated.

graysky · 2015-10-13 20:46:57

I can't help... btrfs fucked me a while ago and I haven't revisited using it. I don't believe it is reliable following an ungraceful shutdown as you have observed assuming there isn't physical damage to your SSD. I recommend that you copy off your data, nuke the partition table and use a stable fs like ext4 for the restored system. Should be easy with tar or rsync assuming you have the space on a clean disk.

mhelvens · 2015-10-13 20:56:44

graysky wrote:

I can't help... btrfs fucked me a while ago and I haven't revisited using it. I don't believe it is reliable following an ungraceful shutdown as you have observed assuming there isn't physical damage to your SSD. I recommend that you copy off your data, nuke the partition table and use a stable fs like ext4 for the restored system. Should be easy with tar or rsync assuming you have the space on a clean disk.

That's not a bad thought, and I do have the space. I have some important questions about this idea, though:

How do I copy this data as faithfully as possible, knowing that it is actually my system disk. I'm thinking about permissions and symlinks and such.
Related to the previous: there are almost definitely some corrupted files on there now. Can I copy while gracefully skipping those? Can I get a list of them?
After I nuke the partition table, how can I find out whether the SSD is actually damaged?

Thanks!

graysky · 2015-10-13 21:22:19

1. I usually use bsdtar like `cd /path/to/damaged/disk ; bsdtar cf /path/to/archive/system.tar --exclude='*.pkg.tar.xz' ./`

You can also double-down and rync as well `rsync -avP /path/to/backup/ /path/to/put/the/backup/` and note the trailing slashes!

2. Not sure about that.
3. Not sure about that... does smartmontools offer anything diagnostic for SSDs?

Last edited by graysky (2015-10-13 21:22:52)

WorMzy · 2015-10-13 21:24:43

Filesystem preferences aside, I/O errors on the hardware are never a good sign. Use smartctl to check the drive's health.

mich41 · 2015-10-14 09:49:13

UNC means uncorrectable read error. Most likely the SSD was in the middle of updating its flash translation tree when it lost power and parts of the tree became lost. You will be unable to read some data.

Start with copying everything that can be read to other disk. If some important files are unreadable, you may have some luck with photorec (provided that only metadata are lost).

Then secure-erase the disk to reset flash translation layer state, optionally test it with dd and secure-erase again, recreate partitions, reformat.

EDIT:
As for copying, I always used just cp -a. IIRC it skips unreadable files and prints error messages on them so you may want to add 2>/somewhere/errors.txt.
Files from packages can be easily recovered by reinstalling all packages.

Last edited by mich41 (2015-10-14 10:00:40)

mhelvens · 2015-10-14 09:56:25

You've all been very helpful!

I unfortunately cannot afford to try it this week, since I need my computer working (I'm lucky it's still working as well as it is). I'll definitely follow your suggestions first thing next week, and post some feedback here.

Cheers!

Last edited by mhelvens (2015-10-14 09:58:56)

graysky · 2015-10-14 10:22:45

+1 for secure erase. The wiki details it.

Arch Linux

#1 2015-10-13 19:43:56

Harddrive trouble after accidentally cutting off power

#2 2015-10-13 19:51:18

Re: Harddrive trouble after accidentally cutting off power

#3 2015-10-13 20:17:03

Re: Harddrive trouble after accidentally cutting off power

#4 2015-10-13 20:46:57

Re: Harddrive trouble after accidentally cutting off power

#5 2015-10-13 20:56:44

Re: Harddrive trouble after accidentally cutting off power

#6 2015-10-13 21:22:19

Re: Harddrive trouble after accidentally cutting off power

#7 2015-10-13 21:24:43

Re: Harddrive trouble after accidentally cutting off power

#8 2015-10-14 09:49:13

Re: Harddrive trouble after accidentally cutting off power

#9 2015-10-14 09:56:25

Re: Harddrive trouble after accidentally cutting off power

#10 2015-10-14 10:22:45

Re: Harddrive trouble after accidentally cutting off power

Board footer