You are not logged in.
I have an older system (AMD Duron 800 MHz CPU), and since upgrading it to 3.0 I've been seeing messages like this in the dmesg during backup jobs:
[ 1000.053309] ata4: lost interrupt (Status 0x50)
[ 1000.053416] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 1000.053460] ata4.00: failed command: READ DMA
[ 1000.053494] ata4.00: cmd c8/00:08:9f:4f:0e/00:00:00:00:00/e4 tag 0 dma 4096 in
[ 1000.053498] res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 1000.053550] ata4.00: status: { DRDY }
[ 1000.053593] ata4: soft resetting link
[ 1000.220254] ata4.00: configured for UDMA/100
[ 1000.220290] ata4: EH complete
The system temporarily locks up for a second or two when this happens (sometimes as long as 30 seconds to a minute), and then appears to resume running normally. Over the weekend the system locked up completely, but there were no hints in any of the logs as to why -- none of those "lost interrupt" drive resets near the end of the logs -- so I can't be sure that it's related, but I haven't noticed any other issues that might've caused that.
I'm backing up to USB over a USB 2 card, and my hard drives are connected to a Promise ATA/100 controller that's on-board. lspci output for these devices:
00:11.0 Mass storage controller: Promise Technology, Inc. PDC20265 (FastTrak100 Lite/Ultra100) (rev 02)
00:0d.0 USB Controller: NEC Corporation USB (rev 43)
00:0d.1 USB Controller: NEC Corporation USB (rev 43)
00:0d.2 USB Controller: NEC Corporation USB 2.0 (rev 04)
I've so far only seen those "lost interrupt" messages when the USB drive is plugged in. It happens reproducibly during any sustained I/O to or from the USB drive, and it seems to happen occasionally at other times as well, but only when the USB drive is connected. (e.g., it's happened a couple of times when the USB drive was connected but not mounted.)
Any ideas?
~Felix.
Edit to add:
- I think I did have I/O problems with this USB 2 card many years ago (around 2007), that were solved by moving it to a different PCI slot.
- I also noticed that I had 'irqpoll' in my kernel line in grub; I tried taking that out but that did not affect this problem. (The system works fine without it, though, so I left it out -- I think it was needed to make it boot many kernels ago.)
Last edited by thetrivialstuff (2011-08-24 20:39:40)
Offline
This kind of problem is really difficult to diagnose, but the first place I'd head is:
cat /proc/interrupts
and verify that the Promise card isn't sharing an interrupt with anything that has a lot of activity (graphics, the USB port that you're using for the backups, etc).
It looks like from your log snippet the ata4 drive is using UDMA instead of PIO, so I'm guessing that's not the problem.
But like I said, these are difficult. Here's a guy that claims to have fixed this type of problem by changing the power supply (go figure):
Offline
cat /proc/interrupts
and verify that the Promise card isn't sharing an interrupt with anything that has a lot of activity (graphics, the USB port that you're using for the backups, etc).
Thanks; I think that might be it!
Man, there are so many useful files to learn about in /proc and /sys...
What I've got is:
10: 1382069 XT-PIC-XT-PIC ehci_hcd:usb1, pata_pdc202xx_old
and lspci -k reveals that pata_pdc202xx_old is indeed the module that handles the Promise controller. There's only one ehci USB device on there, and it's (of course) that USB 2 card, so yes, it appears that those two are sharing an interrupt.
What's the best way of fixing this -- is there something I should do to Linux's config, or is it best to physically move the card / fiddle with the BIOS until it moves IRQ's? If all else fails, I can move the hard drives off the Promise controller and onto the other on-board IDE, but I think on this particular machine, the on-board ones are slower (UDMA 33 or so).
~Felix.
Offline
OK; I had to move the card to a different slot because the BIOS couldn't force the IRQ for the one it was in -- but it's off IRQ 10 now, and I just tried a few large file copies and no more lost interrupts; it's fixed.
Thanks,
~Felix.
Offline