NFS transfers causing hard lockups

matt_garman · 2010-03-12 23:52:23

I have to machines, sharing files between them via NFS. The server is running CentOS 5. The client is running Arch Linux.

When I copy a file from the client (Arch) to the server (CentOS), the client sometimes locks up. I'd say it locks up roughly 20% of the time. And the freeze/lockup is what I call a "hard" lock-up, meaning nothing can be done with the keyboard (e.g. even num lock fails).

Anyone have any ideas what might be causing this? And, more importantly, how to fix it?

Thanks!
Matt

matt_garman · 2010-03-15 13:42:07

Some more info, in case this rings any bells: I enabled the magic sysrq key, managed to replicate this behavior, and noted that the machine still responds to Alt-SysRq-B (reboot) when in the "hard" locked state. If that means anything.

I was also able to capture dmesg on both systems. On the client (Arch) system, there are lots of "task ... blocked for more than 120 seconds" messages. See dmesg snippet below.

Likewise, I captured some log messages on the server side. Since the server is also getting errors, the root of the problem is likely on the server (centos) side. Still, the server has been unchanged for months; only the relatively new Arch client machine triggers these problems. At any rate, I also posted some of the server error messages below.

Thanks again,
Matt

Snippet of dmesg from client (arch) system. The "task blocked" messages appear for just about every running process on the system. For brevity, I only pasted a few. Looks like a lot of references to NFS.

INFO: task sync:17908 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
sync          D 0000000000000002     0 17908   2109 0x00000000                   
 ffff88007a3ea820 0000000000000086 ffffffff814e4f40 000000000000000a             
 0000000000000c60 ffff880079875e18 ffff880079963c30 ffff880079874000
 ffff88007a3eaac8 000000000000f908 ffff88007a3eaac8 ffff880079874000
Call Trace:
 [<ffffffff811365d0>] ? bdi_sched_wait+0x0/0x10
 [<ffffffff811365d9>] ? bdi_sched_wait+0x9/0x10
 [<ffffffff811365d0>] ? bdi_sched_wait+0x0/0x10
 [<ffffffff8133806f>] ? __wait_on_bit+0x4f/0x80                                  
 [<ffffffff811365d0>] ? bdi_sched_wait+0x0/0x10
 [<ffffffff81338117>] ? out_of_line_wait_on_bit+0x77/0x90
 [<ffffffff81075fd0>] ? wake_bit_function+0x0/0x40
 [<ffffffff811365bb>] ? bdi_sync_writeback+0x6b/0x80
 [<ffffffff811365fc>] ? sync_inodes_sb+0x1c/0x100
 [<ffffffff8113b1d2>] ? __sync_filesystem+0x72/0x90
 [<ffffffff8113b2bd>] ? sync_filesystems+0xcd/0x120
 [<ffffffff8113b35c>] ? sys_sync+0x1c/0x40
 [<ffffffff810120c2>] ? system_call_fastpath+0x16/0x1b
usb 2-3: USB disconnect, address 5
[drm] Resetting GPU                                                             
INFO: task kswapd0:25 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kswapd0       D 0000000000000002     0    25      2 0x00000000
 ffff88007b8ca170 0000000000000046 0000000000000000 000000000000cf17
 0000000000000000 ffffffff814a0bc0 ffff880079962820 ffff88007afc2000
 ffff88007b8ca418 000000000000f908 ffff88007b8ca418 ffff88007afc2000
Call Trace:                                                                      
 [<ffffffff810b0c52>] ? delayacct_end+0x82/0xa0
 [<ffffffff81019915>] ? read_tsc+0x5/0x20
 [<ffffffffa0543760>] ? nfs_wait_bit_uninterruptible+0x0/0x10 [nfs]              
 [<ffffffff8133790b>] ? io_schedule+0x6b/0xb0
 [<ffffffffa0543769>] ? nfs_wait_bit_uninterruptible+0x9/0x10 [nfs]
 [<ffffffffa0543760>] ? nfs_wait_bit_uninterruptible+0x0/0x10 [nfs]
 [<ffffffff8133806f>] ? __wait_on_bit+0x4f/0x80
 [<ffffffffa0543760>] ? nfs_wait_bit_uninterruptible+0x0/0x10 [nfs]
 [<ffffffff81338117>] ? out_of_line_wait_on_bit+0x77/0x90
 [<ffffffff81075fd0>] ? wake_bit_function+0x0/0x40
 [<ffffffffa0548c07>] ? nfs_sync_mapping_wait+0x127/0x280 [nfs]
 [<ffffffffa05493c9>] ? nfs_wb_page+0x79/0xd0 [nfs]
 [<ffffffff810d5874>] ? __remove_from_page_cache+0x34/0xf0
 [<ffffffffa0537a1f>] ? nfs_release_page+0x6f/0x90 [nfs]
 [<ffffffff810e0571>] ? shrink_page_list+0x431/0x5e0
 [<ffffffff810e0aab>] ? shrink_inactive_list+0x38b/0x7e0
 [<ffffffff81012a4e>] ? common_interrupt+0xe/0x13
 [<ffffffff810e1215>] ? shrink_zone+0x265/0x370
 [<ffffffff810e1b19>] ? balance_pgdat+0x639/0x690
 [<ffffffff810dedb0>] ? isolate_pages_global+0x0/0x240
 [<ffffffff810e1b70>] ? kswapd+0x0/0x120
 [<ffffffff810e1c4d>] ? kswapd+0xdd/0x120
 [<ffffffff81075fa0>] ? autoremove_wake_function+0x0/0x30
 [<ffffffff810e1b70>] ? kswapd+0x0/0x120
 [<ffffffff810e1b70>] ? kswapd+0x0/0x120
 [<ffffffff81075bde>] ? kthread+0x8e/0xa0
 [<ffffffff8101315a>] ? child_rip+0xa/0x20
 [<ffffffff81075b50>] ? kthread+0x0/0xa0
 [<ffffffff81013150>] ? child_rip+0x0/0x20
INFO: task firefox:2263 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
firefox       D 0000000000000000     0  2263      1 0x00000000
 ffff88007af7d040 0000000000000086 0000000000000000 0000000000000002
 0000000000000000 0000000000000030 ffffffff8149e880 ffff880024992000
 ffff88007af7d2e8 000000000000f908 ffff88007af7d2e8 ffff880024992000
Call Trace:
 [<ffffffff81019915>] ? read_tsc+0x5/0x20
 [<ffffffffa0543760>] ? nfs_wait_bit_uninterruptible+0x0/0x10 [nfs]
 [<ffffffff8133790b>] ? io_schedule+0x6b/0xb0
 [<ffffffffa0543769>] ? nfs_wait_bit_uninterruptible+0x9/0x10 [nfs]
 [<ffffffffa0543760>] ? nfs_wait_bit_uninterruptible+0x0/0x10 [nfs]
 [<ffffffff8133806f>] ? __wait_on_bit+0x4f/0x80
 [<ffffffff8103d4e7>] ? native_flush_tlb_others+0x107/0x130
 [<ffffffffa0543760>] ? nfs_wait_bit_uninterruptible+0x0/0x10 [nfs]
 [<ffffffff81338117>] ? out_of_line_wait_on_bit+0x77/0x90
 [<ffffffff81075fd0>] ? wake_bit_function+0x0/0x40
 [<ffffffffa0548c07>] ? nfs_sync_mapping_wait+0x127/0x280 [nfs]
 [<ffffffffa05493c9>] ? nfs_wb_page+0x79/0xd0 [nfs]
 [<ffffffff810d5874>] ? __remove_from_page_cache+0x34/0xf0
 [<ffffffffa0537a1f>] ? nfs_release_page+0x6f/0x90 [nfs]
 [<ffffffff810e0571>] ? shrink_page_list+0x431/0x5e0
 [<ffffffff81337140>] ? thread_return+0x4e/0x7ae
 [<ffffffff810e0aab>] ? shrink_inactive_list+0x38b/0x7e0
 [<ffffffff8104cae2>] ? finish_task_switch+0x42/0xc0
 [<ffffffff81012c1e>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffff811bf803>] ? cpumask_next_and+0x23/0x40
 [<ffffffff810e1215>] ? shrink_zone+0x265/0x370
 [<ffffffff81019915>] ? read_tsc+0x5/0x20
 [<ffffffff810e2207>] ? try_to_free_pages+0x267/0x3e0
 [<ffffffff810dedb0>] ? isolate_pages_global+0x0/0x240
 [<ffffffff810dab66>] ? __alloc_pages_nodemask+0x446/0x700
 [<ffffffff81079d1a>] ? hrtimer_try_to_cancel+0x7a/0x100
 [<ffffffff810dae32>] ? __get_free_pages+0x12/0x60
 [<ffffffff811274cf>] ? __pollwait+0xbf/0x120
 [<ffffffff8114a9b3>] ? inotify_poll+0x23/0x50
 [<ffffffff81126145>] ? do_sys_poll+0x225/0x4b0
 [<ffffffff81127410>] ? __pollwait+0x0/0x120
 [<ffffffff81127530>] ? pollwake+0x0/0x60
 [<ffffffff81127530>] ? pollwake+0x0/0x60
 [<ffffffff81127530>] ? pollwake+0x0/0x60
 [<ffffffff81127530>] ? pollwake+0x0/0x60
 [<ffffffff81127530>] ? pollwake+0x0/0x60
 [<ffffffff81127530>] ? pollwake+0x0/0x60
 [<ffffffff81127530>] ? pollwake+0x0/0x60
 [<ffffffff81127530>] ? pollwake+0x0/0x60
 [<ffffffff81127530>] ? pollwake+0x0/0x60
 [<ffffffff8104cae2>] ? finish_task_switch+0x42/0xc0
 [<ffffffff81019915>] ? read_tsc+0x5/0x20
 [<ffffffff8107fe71>] ? ktime_get_ts+0x61/0xd0
 [<ffffffff81125df4>] ? poll_select_set_timeout+0x74/0xa0
 [<ffffffff811265d1>] ? sys_poll+0x71/0x110
 [<ffffffff810120c2>] ? system_call_fastpath+0x16/0x1b

Snippet of logs from server (centos) system.

nfsd: non-standard errno: -117
00000000: ff 4d 52 12 ce 7e b2 c3 c5 79 d0 29 2e 5b 76 38  ÿMR.Î~²ÃÅyÐ).[v8
Filesystem "md1": XFS internal error xfs_da_do_buf(2) at line 2112 of file fs/xfs/xfs_da_btree.c.  Caller 0xffffffff88350826

Call Trace:
 [<ffffffff88350725>] :xfs:xfs_da_do_buf+0x503/0x5b1
 [<ffffffff88350826>] :xfs:xfs_da_read_buf+0x16/0x1b
 [<ffffffff88350826>] :xfs:xfs_da_read_buf+0x16/0x1b
 [<ffffffff88355b04>] :xfs:xfs_dir2_leaf_getdents+0x354/0x5ec
 [<ffffffff88355b04>] :xfs:xfs_dir2_leaf_getdents+0x354/0x5ec                    
 [<ffffffff8837ed84>] :xfs:xfs_hack_filldir+0x0/0x5b
 [<ffffffff8837ed84>] :xfs:xfs_hack_filldir+0x0/0x5b
 [<ffffffff88352860>] :xfs:xfs_readdir+0xa7/0xb6
 [<ffffffff885bf29a>] :nfsd:nfs3svc_encode_entry_plus+0x0/0x10
 [<ffffffff8837f419>] :xfs:xfs_file_readdir+0xff/0x14c
 [<ffffffff885bf29a>] :nfsd:nfs3svc_encode_entry_plus+0x0/0x10
 [<ffffffff885bf29a>] :nfsd:nfs3svc_encode_entry_plus+0x0/0x10
 [<ffffffff8003526d>] vfs_readdir+0x77/0xa9                                      
 [<ffffffff885bf29a>] :nfsd:nfs3svc_encode_entry_plus+0x0/0x10
 [<ffffffff885b6ea0>] :nfsd:nfsd_readdir+0x6d/0xc5
 [<ffffffff885be122>] :nfsd:nfsd3_proc_readdirplus+0xf8/0x220
 [<ffffffff885b31db>] :nfsd:nfsd_dispatch+0xd8/0x1d6
 [<ffffffff884c3529>] :sunrpc:svc_process+0x454/0x71b
 [<ffffffff80064614>] __down_read+0x12/0x92
 [<ffffffff885b35a1>] :nfsd:nfsd+0x0/0x2cb
 [<ffffffff885b3746>] :nfsd:nfsd+0x1a5/0x2cb                                     
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff885b35a1>] :nfsd:nfsd+0x0/0x2cb
 [<ffffffff885b35a1>] :nfsd:nfsd+0x0/0x2cb
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

nfsd: non-standard errno: -117
00000000: ff 4d 52 12 ce 7e b2 c3 c5 79 d0 29 2e 5b 76 38  ÿMR.Î~²ÃÅyÐ).[v8
Filesystem "md1": XFS internal error xfs_da_do_buf(2) at line 2112 of file fs/xfs/xfs_da_btree.c.  Caller 0xffffffff88350826

Call Trace:
 [<ffffffff88350725>] :xfs:xfs_da_do_buf+0x503/0x5b1
 [<ffffffff88350826>] :xfs:xfs_da_read_buf+0x16/0x1b
 [<ffffffff88350826>] :xfs:xfs_da_read_buf+0x16/0x1b
 [<ffffffff88355b04>] :xfs:xfs_dir2_leaf_getdents+0x354/0x5ec
 [<ffffffff88355b04>] :xfs:xfs_dir2_leaf_getdents+0x354/0x5ec
 [<ffffffff8837ed84>] :xfs:xfs_hack_filldir+0x0/0x5b
 [<ffffffff8837ed84>] :xfs:xfs_hack_filldir+0x0/0x5b
 [<ffffffff88352860>] :xfs:xfs_readdir+0xa7/0xb6
 [<ffffffff885bf29a>] :nfsd:nfs3svc_encode_entry_plus+0x0/0x10
 [<ffffffff8837f419>] :xfs:xfs_file_readdir+0xff/0x14c
 [<ffffffff885bf29a>] :nfsd:nfs3svc_encode_entry_plus+0x0/0x10
 [<ffffffff885bf29a>] :nfsd:nfs3svc_encode_entry_plus+0x0/0x10
 [<ffffffff8003526d>] vfs_readdir+0x77/0xa9
 [<ffffffff885bf29a>] :nfsd:nfs3svc_encode_entry_plus+0x0/0x10
 [<ffffffff885b6ea0>] :nfsd:nfsd_readdir+0x6d/0xc5
 [<ffffffff885be122>] :nfsd:nfsd3_proc_readdirplus+0xf8/0x220
 [<ffffffff885b31db>] :nfsd:nfsd_dispatch+0xd8/0x1d6
 [<ffffffff884c3529>] :sunrpc:svc_process+0x454/0x71b
 [<ffffffff80064614>] __down_read+0x12/0x92
 [<ffffffff885b35a1>] :nfsd:nfsd+0x0/0x2cb
 [<ffffffff885b3746>] :nfsd:nfsd+0x1a5/0x2cb
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff885b35a1>] :nfsd:nfsd+0x0/0x2cb
 [<ffffffff885b35a1>] :nfsd:nfsd+0x0/0x2cb
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

nfsd: non-standard errno: -117
00000000: ff 4d 52 12 ce 7e b2 c3 c5 79 d0 29 2e 5b 76 38  ÿMR.Î~²ÃÅyÐ).[v8
Filesystem "md1": XFS internal error xfs_da_do_buf(2) at line 2112 of file fs/xfs/xfs_da_btree.c.  Caller 0xffffffff88350826

Ultraman · 2010-03-15 14:01:05

Interesting, I am experiencing problems with NFS as well, at http://bbs.archlinux.org/viewtopic.php?id=93133

I'll check my server log files as well...
[Edit]
Nothing like this in there or anything I can find that looks related to NFS locking up.
I did find something interesting, but it's at a wrong time to be releated to my NFS lockups:

Mar 14 15:02:11 alpha kernel: swapper: page allocation failure. order:0, mode:0x4020
Mar 14 15:02:11 alpha kernel: Pid: 0, comm: swapper Not tainted 2.6.32-ARCH #1
...

But that seems to relate to my rtl8169 Gbit NIC, and when I google this other drivers seem to have this issue as well. Failing to allocate a big block and reverting to atomic operations.
I'll see if this stuff reproduces when I put some heavy load on it...

Last edited by Ultraman (2010-03-15 14:13:58)

matt_garman · 2010-03-18 16:55:37

Just now I replicated this problem trying to move a file from the client machine to the server.

The server (centos) has no error messages in the logs.

On the client (arch) side, I have lots of the "INFO: task ... blocked for more than 120 seconds." messages in dmesg.

However, the client machine is not completely locked up. I can still login remotely via ssh. For the most part, everything seems normal, except that I cannot kill the mv process (even with kill -9). Likewise, I tried to ls the target directory for the file move. The ls process locks up, and also cannot be killed.

Interestingly, other directories on the same NFS mount can be seen with ls. There is even a directory with symbol links to the "lock up" directory that can be listed.

I killed and re-started the NFS daemon on the server. That didn't cause anything interesting to happen.

So basically... bump

nlights · 2010-03-22 09:52:20

Hi,
I have the same issue on one of my clients, including the hanging "ls" symptoms
Server logs no errors at all.

What is interesting/peculiar is that the machine used to work fine until a "pacman -Suy" sometime during late february/early march on server andclient.
First I believed it be a kernel bug and tried downgraded/upgrading the kernel to a number of different versions, still the same problem.

I found some references that nfs-utils started to default to nfsv4 with the 1.2.2- package and forced nfsv3 on the client but it didn't change the situation.
Upgraded/downgraded nfs-utils package as well, but with no effect.

There are some references on the nfsv4 mailing-list regarding a similar issue, but it may not be related at all.
http://linux-nfs.org/pipermail/nfsv4/20 … 12069.html

I was starting to ponder downgrading the server, but on the other hand you seem to have the same issue without touching the server side so maybe you saved me some time on that one

Any news on the issue?

Last edited by nlights (2010-03-22 10:27:13)

.:B:. · 2010-03-22 11:34:11

I just set up NFS v3 again a few days ago since Xbmc cannot handle UPnP (it cannot index UPnP shares for its library it seems). I haven't used NFS v3 for over a year, so my knowledge about it was a bit rusty. I kept no config files of earlier NFS setups.

I have no problems here, streaming 720p (and soon going to test 1080p) over my 100 Mbps LAN. I have followed directions for setting up NFS v3, it seems v4 is a tad more complicated, so I'm not gonna try that yet . I want this stuff working.

nlights · 2010-03-22 23:14:04

Update:
The problem doesn't seem to be related to nfs versions either...
I did tests with configurations for both nfsv3 and nfsv4 today and the symptoms were the same with both versions.

I forgot to mention earlier that my client is virtual (qemu-kvm), maybe it is clock related somehow...

INFO: task kswapd0:21 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kswapd0       D 0000000000000002     0    21      2 0x00000000
 ffff88003f981760 0000000000000046 ffff88003f9816b0 ffff88003f980000
 ffff88003f981fd8 ffffffffb45872d0 0000000030f2ae46 ffffffff814da000
 ffff88003f981700 ffff88003f981fd8 ffff88003f980000 ffff88003f980000
Call Trace:
 [<ffffffff8107d539>] ? ktime_get_ts+0xa9/0xe0
 [<ffffffffa0393290>] ? nfs_wait_bit_uninterruptible+0x0/0x10 [nfs]
 [<ffffffff8135970e>] io_schedule+0x6e/0xb0
 [<ffffffffa0393299>] nfs_wait_bit_uninterruptible+0x9/0x10 [nfs]
 [<ffffffff81359e77>] __wait_on_bit+0x57/0x80
 [<ffffffffa0393290>] ? nfs_wait_bit_uninterruptible+0x0/0x10 [nfs]
 [<ffffffff81359f13>] out_of_line_wait_on_bit+0x73/0x90
 [<ffffffff81073090>] ? wake_bit_function+0x0/0x40
 [<ffffffffa0393287>] nfs_wait_on_request+0x27/0x30 [nfs]
 [<ffffffffa03983d0>] nfs_sync_mapping_wait+0x120/0x280 [nfs]
 [<ffffffffa0398ba9>] nfs_wb_page+0x79/0xd0 [nfs]
 [<ffffffffa03878bf>] nfs_release_page+0x5f/0x80 [nfs]
 [<ffffffff810dabfb>] try_to_release_page+0x2b/0x50
 [<ffffffff810e8af3>] shrink_page_list+0x453/0x5f0
 [<ffffffff810e8f59>] shrink_inactive_list+0x2c9/0x770
 [<ffffffff810e3e45>] ? determine_dirtyable_memory+0x15/0x30
 [<ffffffff810e3ee2>] ? get_dirty_limits+0x22/0x2e0
 [<ffffffff81049057>] ? update_curr+0xe7/0x1a0
 [<ffffffff810e9671>] shrink_zone+0x271/0x460
 [<ffffffff810ea119>] balance_pgdat+0x6b9/0x720
 [<ffffffff810e7df0>] ? isolate_pages_global+0x0/0x270
 [<ffffffff810ea279>] kswapd+0xf9/0x220
 [<ffffffff81073050>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8135c0a2>] ? _raw_spin_unlock_irqrestore+0x32/0x40
 [<ffffffff810ea180>] ? kswapd+0x0/0x220
 [<ffffffff81072aee>] kthread+0x8e/0xa0
 [<ffffffff8100ae24>] kernel_thread_helper+0x4/0x10
 [<ffffffff81072a60>] ? kthread+0x0/0xa0
 [<ffffffff8100ae20>] ? kernel_thread_helper+0x0/0x10

AndyRTR · 2010-03-23 05:59:48

https://bugzilla.kernel.org/show_bug.cgi?id=15552 - it seems we are affected by 2 nfs client problems currently. both can be patched.

nlights · 2010-03-23 15:00:08

Update ( Workaround ):
Downgraded client kernel26 and kernel26-firmware to 2.6.31.6-1 and the machine has been stable for quite some time now even under heavy nfs utilization.

Arch Linux

#1 2010-03-12 23:52:23

NFS transfers causing hard lockups

#2 2010-03-15 13:42:07

Re: NFS transfers causing hard lockups

#3 2010-03-15 14:01:05

Re: NFS transfers causing hard lockups

#4 2010-03-18 16:55:37

Re: NFS transfers causing hard lockups

#5 2010-03-22 09:52:20

Re: NFS transfers causing hard lockups

#6 2010-03-22 11:34:11

Re: NFS transfers causing hard lockups

#7 2010-03-22 23:14:04

Re: NFS transfers causing hard lockups

#8 2010-03-23 05:59:48

Re: NFS transfers causing hard lockups

#9 2010-03-23 15:00:08

Re: NFS transfers causing hard lockups

Board footer