You are not logged in.

#1 2014-06-25 17:35:35

bioe007
Member
Registered: 2007-11-12
Posts: 56

DMA read errors: hardware problem?

My server box is a pretty simple ASUS K8NDRE with 4G ram, several optical drives and 5 HDD (one is an old IDE drive). It has been running arch since around 2008 with not really any problems to speak of.. Now I can't boot it and I think it may be a bad hard drive, but I'm not convinced.

I used to occasionally get errors similar to this:

[8423902.269740] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[8423902.269776] ata2.00: irq_stat 0x40000001
[8423902.269805] ata2.00: failed command: READ DMA

I apologize for not using the exact message but I can no longer boot the box or conveniently get access to the error logs. I might be able to grab something later with a live cd.

I *think* this was always associated with the same drive; a WD 500GB drive with partitions for /boot, /root and the rest drive is part of an LVM group.

About a month ago I got a few lock ups with messages like those above showing on-screen. After resetting the machine things seemed OK. I did some googling and it seemed the prevailing wisdom was bad drive or maybe sata cables. So I replaced the cable to that drive and the problem seemed to go away. I did run smart tests but nothing turned up.

A couple days ago the box locked up with a screen full of error messages similar to the above and now it can not boot at all. I am certain it is the one drive because I tried moving the cables around and tracking which 'ataX' was listed in the error. The same drive was problematic regardless of cable or sata connector on the motherboard.

So I thought then it *must* be a bad HDD and I popped in a trinity rescue kit (TRK).

Using TRK I was able to 'mountallfs -l' and no errors are reported. I can view all the contents of all drives including LVM members and I don't see any errors. I thought if the HDD was so bad Arch could no longer boot I'd at least see an error or two (?)

I unmounted the disks and ran 'badblocks -sv' on all my hard drives - zero bad blocks reported. I'm no drive-recovery-guru but something seems weird about this, no bad blocks, smart doesn't report anything super funky and yet I still can't boot ?

Now I'm not sure what the problem is, could it be a kernel issue?

I have picked up another TB drive and partitioned it like the 'problem' drive. I'm thinking of just trying to dd if=/dev/sdaX of=/dev/sdcX for all of the bad drive's partitions. If I do this and try to boot of the new drive will LVM2 have a fit or anything?

Basically I'm asking for advice on what to do/test/try next. My highest priority is, of course, to preserve the information on the drive.

Thanks

Offline

#2 2014-06-25 18:42:04

nstgc
Member
Registered: 2014-03-17
Posts: 393

Re: DMA read errors: hardware problem?

From what I recall from where this happened to me (and I similarly flipped out) the cause is actually in the cable, not the drive itself. Replacing the cable should prevent further issues.

Offline

#3 2014-06-25 18:48:18

MoonSwan
Member
From: Great White North
Registered: 2008-01-23
Posts: 881

Re: DMA read errors: hardware problem?

Try the drive in a different computer?  It may be a quirk of that motherboard although this seems like a long shot.  Is RAID involved?  Are there errors in the systmed journals?

If you're going to use dd at all I would highly recommend looking at ddrescue as an option.

Offline

#4 2014-06-25 20:07:07

bioe007
Member
Registered: 2007-11-12
Posts: 56

Re: DMA read errors: hardware problem?

nstgc wrote:

From what I recall from where this happened to me (and I similarly flipped out) the cause is actually in the cable, not the drive itself. Replacing the cable should prevent further issues.

Thanks, I actually have already tried that I bought two new cables and tried them both. I've tried swapping them around all the drives and it seems only this one is acting funky.

I thought it was fixed when I changed the cable but then this happened.

MoonSwan wrote:

Try the drive in a different computer?  It may be a quirk of that motherboard although this seems like a long shot.  Is RAID involved?  Are there errors in the systmed journals?

If you're going to use dd at all I would highly recommend looking at ddrescue as an option.

I don't have another desktop machine at all.

No RAID in use.

I haven't been into the logs since it crashed as I don't want to disturb any of the contents on the off chance it could corrupt anything further. Is it safe to mount things read-only? I read somewhere a dying HD may not honor the read-only mount.

I'll check out the ddrescue link, thanks for that, although the warning at the top kind of makes me nervous smile

Offline

#5 2014-06-26 00:42:32

MoonSwan
Member
From: Great White North
Registered: 2008-01-23
Posts: 881

Re: DMA read errors: hardware problem?

I understand about your misgivings regarding mounting the partitions.  Mounting them as read only should work, in my experience anyway, and from there pull data from them.  I'd get System Rescue CD if only because it specialises in helping to fix these situations.  Another option is to get the latest Ultimate Boot CD because it contains a lot of low level utilities to deal with hard drive issues.  It also had Parted Magic on it but I don't know if that is still the case.  I found PM very handy when things went south but the developer has recently changed to a subscription service to get his CDs and I don't know whether UBCD still carries PM because of that.

Edit: fixed typos.

Last edited by MoonSwan (2014-06-26 00:44:07)

Offline

#6 2014-06-29 03:18:20

bioe007
Member
Registered: 2007-11-12
Posts: 56

Re: DMA read errors: hardware problem?

Ok, some update.

I am using ddrescue now and it also is finding no bad blocks.

wth? anybody got an idea?

Also, I'm not sure what to do about 'telling' lvm to start using this new drive (?)

Offline

Board footer

Powered by FluxBB