You are not logged in.

#1 2010-03-14 17:09:35

Ultraman
Member
Registered: 2009-12-24
Posts: 242

[NFS] Lockup when copying several GBs

I've been experiencing problems with NFSv4.
I have a home server on which I installed NFS for accessing files from my main rig.
The /etc/exports file on my server:

/export            192.168.0.2(fsid=0,insecure,no_subtree_check,async)
/export/share        192.168.0.2(rw,nohide,insecure,no_subtree_check,async)
/export/share/incoming    192.168.0.2(rw,nohide,insecure,no_subtree_check,async)

share and share/incoming are different partitions (and therefore different filesystems)

While copying big amounts of data (dozens of GBs) from my PC to the server, over a Gbit Ethernet connection, the copy operation locks up. I see the load numbers increase to number over 5.00 and my PC becomes very unresponsive. Feels like some code enters a continues loop.
I'm also unable to kill the process by sending it SIGTERMs and SIGKILLs, they just keep running.
I've been able to reproduce the lockups using Thunar, and mc inside X, and also with mc just after booting in a console only (no X loaded, to ensure that thunar daemon, hal or gamin are not interfering). So it even happens in console-only mode!
My server does not lock up during all of this.
Both systems run Arch Linux and are uptodate.

If I try to copy the same datasets using FTP, there is no problem.
I watch videos over the NFS share as well (server to client traffic), this works fine. But this kind of use creates only little traffic.

This is the first time I've used NFS at home, so I'm not experienced configuring it, perhaps I made a mistake somewhere?
I've looked at the NFS and NFSv4 wiki articles and followed them up. My PC also syncs it's time using NTP, and my server acts as NTP server, as the wiki recommends.

Does anybody have any pointers what to do about this or where I should look?

If any more information is needed to diagnose the problem, feel free to ask. I will try to provide as much as I can.

Last edited by Ultraman (2010-03-14 23:05:08)

Offline

#2 2010-03-14 22:20:51

sand_man
Member
From: Australia
Registered: 2008-06-10
Posts: 2,164

Re: [NFS] Lockup when copying several GBs

Be VERY careful. This happened to me a while ago. I spend hours trying to work out what was wrong then finally my hard drive on the server died. CPU usage was normal but the average load kept rising and rising.


neutral

Offline

#3 2010-03-14 23:04:09

Ultraman
Member
Registered: 2009-12-24
Posts: 242

Re: [NFS] Lockup when copying several GBs

sand_man wrote:

Be VERY careful. This happened to me a while ago. I spend hours trying to work out what was wrong then finally my hard drive on the server died. CPU usage was normal but the average load kept rising and rising.

Thanks for the input.
I'm copying data from my PC to a 3-disk mdadm RAID5 array (/srv/share). So if one drive is failing, that should not be a problem.
I had mdadm check the array, checks out fine. SMART status is also OK, Short tests done. Extended tests could take a while, but I will run them to be sure.
When I do the same copy operations using FTP, my speeds are just as high (~75MB/s) but it is also perfectly stable.

Thinking of switching to NFSv3, just to try it. Will look into that tomorrow...

Last edited by Ultraman (2010-03-14 23:04:39)

Offline

#4 2010-03-14 23:34:59

sand_man
Member
From: Australia
Registered: 2008-06-10
Posts: 2,164

Re: [NFS] Lockup when copying several GBs

Ultraman wrote:
sand_man wrote:

Be VERY careful. This happened to me a while ago. I spend hours trying to work out what was wrong then finally my hard drive on the server died. CPU usage was normal but the average load kept rising and rising.

Thanks for the input.
I'm copying data from my PC to a 3-disk mdadm RAID5 array (/srv/share). So if one drive is failing, that should not be a problem.
I had mdadm check the array, checks out fine. SMART status is also OK, Short tests done. Extended tests could take a while, but I will run them to be sure.
When I do the same copy operations using FTP, my speeds are just as high (~75MB/s) but it is also perfectly stable.

Thinking of switching to NFSv3, just to try it. Will look into that tomorrow...

Ok good to know. I haven't used NFS4 yet. Sounds like there are still teething problems.


neutral

Offline

#5 2010-03-16 21:01:09

Ultraman
Member
Registered: 2009-12-24
Posts: 242

Re: [NFS] Lockup when copying several GBs

I couldn't get NFS3 working properly, mount.nfs: Argument list too long kept popping up...

I fiddled around with NFS4 some more, changed some configuration options around. Even tried sync mode (SLOW, so back to async).
Did a copy job just now. Ran fine yikes
Perhaps I solved it...

Relevant: http://episteme.arstechnica.com/eve/for … 1005224931, I didn't know about "crossmnt" yet.

Last edited by Ultraman (2010-03-16 21:02:18)

Offline

#6 2010-03-16 21:57:12

Ultraman
Member
Registered: 2009-12-24
Posts: 242

Re: [NFS] Lockup when copying several GBs

Beh. Crashed again after a while.

But I did see the following in my everything.log.

Mar 16 22:44:40 bacchus kernel: INFO: task kswapd0:61 blocked for more than 120 seconds.
Mar 16 22:44:40 bacchus kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 16 22:44:40 bacchus kernel: kswapd0       D 0000000000000002     0    61      2 0x00000000
Mar 16 22:44:40 bacchus kernel: ffff88013fab42e0 0000000000000046 0000000000000000 ffffffff810686d2
Mar 16 22:44:40 bacchus kernel: ffff880091378f08 0000000000000286 ffff88013f86d040 ffff88013fb8e000
Mar 16 22:44:40 bacchus kernel: ffff88013fab4588 000000000000f908 ffff88013fab4588 ffff88013fb8e000
Mar 16 22:44:40 bacchus kernel: Call Trace:
Mar 16 22:44:40 bacchus kernel: [<ffffffff810686d2>] ? mod_timer+0x132/0x290
Mar 16 22:44:40 bacchus kernel: [<ffffffff810b0c52>] ? delayacct_end+0x82/0xa0
Mar 16 22:44:40 bacchus kernel: [<ffffffff81019915>] ? read_tsc+0x5/0x20
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03a7760>] ? nfs_wait_bit_uninterruptible+0x0/0x10 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff8133790b>] ? io_schedule+0x6b/0xb0
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03a7769>] ? nfs_wait_bit_uninterruptible+0x9/0x10 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03a7760>] ? nfs_wait_bit_uninterruptible+0x0/0x10 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff8133806f>] ? __wait_on_bit+0x4f/0x80
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03a7760>] ? nfs_wait_bit_uninterruptible+0x0/0x10 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff81338117>] ? out_of_line_wait_on_bit+0x77/0x90
Mar 16 22:44:40 bacchus kernel: [<ffffffff81075fd0>] ? wake_bit_function+0x0/0x40
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03acc07>] ? nfs_sync_mapping_wait+0x127/0x280 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03ad3c9>] ? nfs_wb_page+0x79/0xd0 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff810d5874>] ? __remove_from_page_cache+0x34/0xf0
Mar 16 22:44:40 bacchus kernel: [<ffffffffa039ba1f>] ? nfs_release_page+0x6f/0x90 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e0571>] ? shrink_page_list+0x431/0x5e0
Mar 16 22:44:40 bacchus kernel: [<ffffffff81337140>] ? thread_return+0x4e/0x7ae
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e0aab>] ? shrink_inactive_list+0x38b/0x7e0
Mar 16 22:44:40 bacchus kernel: [<ffffffff8104cae2>] ? finish_task_switch+0x42/0xc0
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e1215>] ? shrink_zone+0x265/0x370
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e1b19>] ? balance_pgdat+0x639/0x690
Mar 16 22:44:40 bacchus kernel: [<ffffffff810dedb0>] ? isolate_pages_global+0x0/0x240
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e1b70>] ? kswapd+0x0/0x120
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e1c4d>] ? kswapd+0xdd/0x120
Mar 16 22:44:40 bacchus kernel: [<ffffffff81075fa0>] ? autoremove_wake_function+0x0/0x30
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e1b70>] ? kswapd+0x0/0x120
Mar 16 22:44:40 bacchus kernel: [<ffffffff81075bde>] ? kthread+0x8e/0xa0
Mar 16 22:44:40 bacchus kernel: [<ffffffff8101315a>] ? child_rip+0xa/0x20
Mar 16 22:44:40 bacchus kernel: [<ffffffff81075b50>] ? kthread+0x0/0xa0
Mar 16 22:44:40 bacchus kernel: [<ffffffff81013150>] ? child_rip+0x0/0x20
Mar 16 22:44:40 bacchus kernel: INFO: task thunar:21872 blocked for more than 120 seconds.
Mar 16 22:44:40 bacchus kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 16 22:44:40 bacchus kernel: thunar        D 0000000000000002     0 21872   4196 0x00000004
Mar 16 22:44:40 bacchus kernel: ffff8800bcbb9410 0000000000000082 ffffffff81012a4e 000000000085ca12
Mar 16 22:44:40 bacchus kernel: ffff88005ecbb008 0000000000000001 ffff88013e809410 ffff8801199a8000
Mar 16 22:44:40 bacchus kernel: ffff8800bcbb96b8 000000000000f908 ffff8800bcbb96b8 ffff8801199a8000
Mar 16 22:44:40 bacchus kernel: Call Trace:
Mar 16 22:44:40 bacchus kernel: [<ffffffff81012a4e>] ? common_interrupt+0xe/0x13
Mar 16 22:44:40 bacchus kernel: [<ffffffff810b0c52>] ? delayacct_end+0x82/0xa0
Mar 16 22:44:40 bacchus kernel: [<ffffffff81019915>] ? read_tsc+0x5/0x20
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03a7760>] ? nfs_wait_bit_uninterruptible+0x0/0x10 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff8133790b>] ? io_schedule+0x6b/0xb0
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03a7769>] ? nfs_wait_bit_uninterruptible+0x9/0x10 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03a7760>] ? nfs_wait_bit_uninterruptible+0x0/0x10 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff8133806f>] ? __wait_on_bit+0x4f/0x80
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03a7760>] ? nfs_wait_bit_uninterruptible+0x0/0x10 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff81338117>] ? out_of_line_wait_on_bit+0x77/0x90
Mar 16 22:44:40 bacchus kernel: [<ffffffff81075fd0>] ? wake_bit_function+0x0/0x40
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03acc07>] ? nfs_sync_mapping_wait+0x127/0x280 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03ad3c9>] ? nfs_wb_page+0x79/0xd0 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff810d5874>] ? __remove_from_page_cache+0x34/0xf0
Mar 16 22:44:40 bacchus kernel: [<ffffffffa039ba1f>] ? nfs_release_page+0x6f/0x90 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e0571>] ? shrink_page_list+0x431/0x5e0
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e0cc9>] ? shrink_inactive_list+0x5a9/0x7e0
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e0aab>] ? shrink_inactive_list+0x38b/0x7e0
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e1215>] ? shrink_zone+0x265/0x370
Mar 16 22:44:40 bacchus kernel: [<ffffffff81019915>] ? read_tsc+0x5/0x20
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e2207>] ? try_to_free_pages+0x267/0x3e0
Mar 16 22:44:40 bacchus kernel: [<ffffffff810dedb0>] ? isolate_pages_global+0x0/0x240
Mar 16 22:44:40 bacchus kernel: [<ffffffff810dab66>] ? __alloc_pages_nodemask+0x446/0x700
Mar 16 22:44:40 bacchus kernel: [<ffffffff810d42f6>] ? grab_cache_page_write_begin+0x96/0xe0
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03ad6f1>] ? nfs_updatepage+0x2d1/0x5a0 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffffa039be70>] ? nfs_write_begin+0x70/0x220 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff810d4ed3>] ? generic_file_buffered_write+0x133/0x300
Mar 16 22:44:40 bacchus kernel: [<ffffffff810d548f>] ? __generic_file_aio_write+0x23f/0x460
Mar 16 22:44:40 bacchus kernel: [<ffffffff8133861d>] ? __mutex_lock_slowpath+0x23d/0x330
Mar 16 22:44:40 bacchus kernel: [<ffffffff810d571d>] ? generic_file_aio_write+0x6d/0xe0
Mar 16 22:44:40 bacchus kernel: [<ffffffffa039ce52>] ? nfs_file_write+0x102/0x250 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff811147d2>] ? do_sync_write+0xe2/0x120
Mar 16 22:44:40 bacchus kernel: [<ffffffff811c4c24>] ? rb_insert_color+0x94/0x140
Mar 16 22:44:40 bacchus kernel: [<ffffffff81075fa0>] ? autoremove_wake_function+0x0/0x30
Mar 16 22:44:40 bacchus kernel: [<ffffffff8104cae2>] ? finish_task_switch+0x42/0xc0
Mar 16 22:44:40 bacchus kernel: [<ffffffff81337140>] ? thread_return+0x4e/0x7ae
Mar 16 22:44:40 bacchus kernel: [<ffffffff811153d8>] ? vfs_write+0xb8/0x1a0
Mar 16 22:44:40 bacchus kernel: [<ffffffff811155ae>] ? sys_write+0x4e/0x90
Mar 16 22:44:40 bacchus kernel: [<ffffffff810120c2>] ? system_call_fastpath+0x16/0x1b
Mar 16 22:44:40 bacchus kernel: INFO: task flush-0:16:21875 blocked for more than 120 seconds.
Mar 16 22:44:40 bacchus kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 16 22:44:40 bacchus kernel: flush-0:16    D 0000000000000002     0 21875      2 0x00000000
Mar 16 22:44:40 bacchus kernel: ffff8800bc877860 0000000000000046 0000000000000000 ffffffff81337140
Mar 16 22:44:40 bacchus kernel: 0000000000000108 0000000000000202 ffff88013f86b580 ffff8801246d6000
Mar 16 22:44:40 bacchus kernel: ffff8800bc877b08 000000000000f908 ffff8800bc877b08 ffff8801246d6000
Mar 16 22:44:40 bacchus kernel: Call Trace:
Mar 16 22:44:40 bacchus kernel: [<ffffffff81337140>] ? thread_return+0x4e/0x7ae
Mar 16 22:44:40 bacchus kernel: [<ffffffff810b0c52>] ? delayacct_end+0x82/0xa0
Mar 16 22:44:40 bacchus kernel: [<ffffffff81019915>] ? read_tsc+0x5/0x20
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03a7760>] ? nfs_wait_bit_uninterruptible+0x0/0x10 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff8133790b>] ? io_schedule+0x6b/0xb0
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03a7769>] ? nfs_wait_bit_uninterruptible+0x9/0x10 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03a7760>] ? nfs_wait_bit_uninterruptible+0x0/0x10 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff8133806f>] ? __wait_on_bit+0x4f/0x80
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03a7760>] ? nfs_wait_bit_uninterruptible+0x0/0x10 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff81338117>] ? out_of_line_wait_on_bit+0x77/0x90
Mar 16 22:44:40 bacchus kernel: [<ffffffff81075fd0>] ? wake_bit_function+0x0/0x40
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03acc07>] ? nfs_sync_mapping_wait+0x127/0x280 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03ad3c9>] ? nfs_wb_page+0x79/0xd0 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffffa039ba1f>] ? nfs_release_page+0x6f/0x90 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e0571>] ? shrink_page_list+0x431/0x5e0
Mar 16 22:44:40 bacchus kernel: [<ffffffff810dee61>] ? isolate_pages_global+0xb1/0x240
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e0aab>] ? shrink_inactive_list+0x38b/0x7e0
Mar 16 22:44:40 bacchus kernel: [<ffffffff810d8ac6>] ? __rmqueue+0x286/0x480
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e1215>] ? shrink_zone+0x265/0x370
Mar 16 22:44:40 bacchus kernel: [<ffffffff81019915>] ? read_tsc+0x5/0x20
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e2207>] ? try_to_free_pages+0x267/0x3e0
Mar 16 22:44:40 bacchus kernel: [<ffffffff810dedb0>] ? isolate_pages_global+0x0/0x240
Mar 16 22:44:40 bacchus kernel: [<ffffffff810dab66>] ? __alloc_pages_nodemask+0x446/0x700
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03ad079>] ? nfs_writedata_alloc+0x69/0xb0 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff8110b4c3>] ? __slab_alloc+0x1d3/0x810
Mar 16 22:44:40 bacchus kernel: [<ffffffff8110cb36>] ? __kmalloc+0x236/0x250
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03ad079>] ? nfs_writedata_alloc+0x69/0xb0 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff811c3f84>] ? radix_tree_gang_lookup_tag_slot+0x94/0xf0
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03ad0d4>] ? nfs_flush_one+0x14/0xe0 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03a758a>] ? nfs_pageio_doio+0x2a/0x70 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03a761b>] ? nfs_pageio_add_request+0x4b/0xf0 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03ababf>] ? nfs_do_writepage+0x13f/0x1a0 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03ac196>] ? nfs_writepages_callback+0x16/0x40 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff810db971>] ? write_cache_pages+0x1e1/0x440
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03ac180>] ? nfs_writepages_callback+0x0/0x40 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03ac0f0>] ? nfs_writepages+0xe0/0x170 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03ad0c0>] ? nfs_flush_one+0x0/0xe0 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff81136063>] ? writeback_single_inode+0xf3/0x3c0
Mar 16 22:44:40 bacchus kernel: [<ffffffff811370c3>] ? writeback_inodes_wb+0x403/0x5d0
Mar 16 22:44:40 bacchus kernel: [<ffffffff8113739d>] ? wb_writeback+0x10d/0x1e0
Mar 16 22:44:40 bacchus kernel: [<ffffffff81137631>] ? wb_do_writeback+0xe1/0x1f0
Mar 16 22:44:40 bacchus kernel: [<ffffffff81067ea0>] ? process_timeout+0x0/0x10
Mar 16 22:44:40 bacchus kernel: [<ffffffff81137783>] ? bdi_writeback_task+0x43/0xd0
Mar 16 22:44:40 bacchus kernel: [<ffffffff810ed120>] ? bdi_start_fn+0x0/0xf0
Mar 16 22:44:40 bacchus kernel: [<ffffffff810ed19e>] ? bdi_start_fn+0x7e/0xf0
Mar 16 22:44:40 bacchus kernel: [<ffffffff810ed120>] ? bdi_start_fn+0x0/0xf0
Mar 16 22:44:40 bacchus kernel: [<ffffffff81075bde>] ? kthread+0x8e/0xa0
Mar 16 22:44:40 bacchus kernel: [<ffffffff8101315a>] ? child_rip+0xa/0x20
Mar 16 22:44:40 bacchus kernel: [<ffffffff81075b50>] ? kthread+0x0/0xa0
Mar 16 22:44:40 bacchus kernel: [<ffffffff81013150>] ? child_rip+0x0/0x20

Seems kswapd gets angry for some reason. And after that... everything that wants to be friends with it gets punched in the face, who in turn punch others in the face. Which makes my browser and lots of other stuff halt.
I cannot use the "sync" command in a terminal after this, of course. Reboot or shutdown also doesn't want to do anything, because they are waiting for the punching to stop...

Anybody have an idea what's going on here?

Offline

#7 2010-03-16 22:18:33

sand_man
Member
From: Australia
Registered: 2008-06-10
Posts: 2,164

Re: [NFS] Lockup when copying several GBs

Did you try NFSv3?


neutral

Offline

#8 2010-03-16 23:54:50

Ultraman
Member
Registered: 2009-12-24
Posts: 242

Re: [NFS] Lockup when copying several GBs

sand_man wrote:

Did you try NFSv3?

Yup, see my post before the above one, I had problems with it.

I couldn't get NFS3 working properly, mount.nfs: Argument list too long kept popping up...

So i'll have to sort that out if I can.
You ever got that error?

Offline

#9 2010-03-17 01:39:42

sand_man
Member
From: Australia
Registered: 2008-06-10
Posts: 2,164

Re: [NFS] Lockup when copying several GBs

Ultraman wrote:
sand_man wrote:

Did you try NFSv3?

Yup, see my post before the above one, I had problems with it.

I couldn't get NFS3 working properly, mount.nfs: Argument list too long kept popping up...

So i'll have to sort that out if I can.
You ever got that error?

Vaguely remember getting a similar error when trying to mount manually but it was ok from fstab.
Sorry I'm out of ideas.


neutral

Offline

#10 2010-03-17 19:47:21

AndyRTR
Developer
From: Magdeburg/Germany
Registered: 2005-10-07
Posts: 1,641

Re: [NFS] Lockup when copying several GBs

This NFS3 lockup is something different: http://bugzilla.kernel.org/show_bug.cgi?id=15552

So far I haven't had time to confirm the fix that is recommended there. It only happens to me with 2.6.33.1 not with 2.6.33. Maybe it's a regression.

Offline

#11 2010-04-08 04:57:48

Kasumi_Ninja
Member
Registered: 2009-12-31
Posts: 54

Re: [NFS] Lockup when copying several GBs

When transferring large files NFS locks up all the time for me neutral

Offline

#12 2010-04-08 05:37:45

kalasmannen
Member
From: Örebro, Sweden
Registered: 2008-02-27
Posts: 39
Website

Re: [NFS] Lockup when copying several GBs

Im also experiencing this. In my setup, i have a file server running Centos and exporting NFSv3 to my desktop clients, among them my workstation, wich i run Arch on. Recently, i started copying some large files over the NFS, and my system behaved much like Ultraman explains.

The transfer rates varies very much while copying too, from about 75 mb/sec down to under 10 mb/sec. It also seems to be stalling from time to time, and stops transfer for a few seconds, and then jumps at it again at a reasonable speed, and then the speed goes down again and this is repeated.

My system does not lock up completely, but applications become very inresponsive and the system load goes up to around 7-8 before the copy-process hangs completely. The process is then unkillable, and the only way to get rid of it is to do a reboot.

Transfering the same file via SMB gives me a very stable transfer rate, and otherwise works just fine. I don't think this has been a problem in the past either.

Offline

#13 2010-04-08 10:15:42

Ultraman
Member
Registered: 2009-12-24
Posts: 242

Re: [NFS] Lockup when copying several GBs

kalasmannen, have you tried increasing the buffer sizes? You can use the rsize and wsize options for that.
I currently mount my NFS with the following line in my /etc/fstab:

192.168.0.8:/ /mnt/alpha/share nfs4 defaults,auto,user,async,noatime,rsize=65536,wsize=65536 0 0

Stability improved quite a bit after that. I haven't run into this so often anymore.

I also remember reading something interesting concerning buffers in the changelog of the 2.6.33 kernel, which I hope fixes this for me. There was a fix for my NIC, RTL8111 which uses the r8169 driver, that had something to do with buffers. But I read that a while back, and I'm unable to find it at this moment quickly.

Keep us posted on developments at your end. smile

Offline

#14 2010-04-08 11:28:27

kalasmannen
Member
From: Örebro, Sweden
Registered: 2008-02-27
Posts: 39
Website

Re: [NFS] Lockup when copying several GBs

Ultraman wrote:

kalasmannen, have you tried increasing the buffer sizes? You can use the rsize and wsize options for that.
I currently mount my NFS with the following line in my /etc/fstab:

192.168.0.8:/ /mnt/alpha/share nfs4 defaults,auto,user,async,noatime,rsize=65536,wsize=65536 0 0

Stability improved quite a bit after that. I haven't run into this so often anymore.

I also remember reading something interesting concerning buffers in the changelog of the 2.6.33 kernel, which I hope fixes this for me. There was a fix for my NIC, RTL8111 which uses the r8169 driver, that had something to do with buffers. But I read that a while back, and I'm unable to find it at this moment quickly.

Keep us posted on developments at your end. smile

Yes, i've increased the w and r-sizes as suggested in the wiki. (http://wiki.archlinux.org/index.php/Nfs … nd_gigabit)

That did improve it a bit, but i still eventually run into this. When transferring a 100GB file, i get a complete stall at about 30GB, and that is when it won't go any further no matter what. The load is about 6-7 at that point. I will try to experiment a bit more with the server and see if that helps.

I also tried UDP vs TCP, but they both behave pretty much the same way. UDP gives me somewhat less problems, but sooner or later i get stuck anyway.

Offline

#15 2010-04-10 12:00:15

Tera
Member
From: Finland
Registered: 2007-01-25
Posts: 81

Re: [NFS] Lockup when copying several GBs

Ultraman wrote:

Beh. Crashed again after a while.

But I did see the following in my everything.log.

Mar 16 22:44:40 bacchus kernel: INFO: task kswapd0:61 blocked for more than 120 seconds.
Mar 16 22:44:40 bacchus kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 16 22:44:40 bacchus kernel: kswapd0       D 0000000000000002     0    61      2 0x00000000
Mar 16 22:44:40 bacchus kernel: ffff88013fab42e0 0000000000000046 0000000000000000 ffffffff810686d2
Mar 16 22:44:40 bacchus kernel: ffff880091378f08 0000000000000286 ffff88013f86d040 ffff88013fb8e000
Mar 16 22:44:40 bacchus kernel: ffff88013fab4588 000000000000f908 ffff88013fab4588 ffff88013fb8e000
Mar 16 22:44:40 bacchus kernel: Call Trace:
Mar 16 22:44:40 bacchus kernel: [<ffffffff810686d2>] ? mod_timer+0x132/0x290
Mar 16 22:44:40 bacchus kernel: [<ffffffff810b0c52>] ? delayacct_end+0x82/0xa0
Mar 16 22:44:40 bacchus kernel: [<ffffffff81019915>] ? read_tsc+0x5/0x20
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03a7760>] ? nfs_wait_bit_uninterruptible+0x0/0x10 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff8133790b>] ? io_schedule+0x6b/0xb0
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03a7769>] ? nfs_wait_bit_uninterruptible+0x9/0x10 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03a7760>] ? nfs_wait_bit_uninterruptible+0x0/0x10 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff8133806f>] ? __wait_on_bit+0x4f/0x80
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03a7760>] ? nfs_wait_bit_uninterruptible+0x0/0x10 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff81338117>] ? out_of_line_wait_on_bit+0x77/0x90
Mar 16 22:44:40 bacchus kernel: [<ffffffff81075fd0>] ? wake_bit_function+0x0/0x40
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03acc07>] ? nfs_sync_mapping_wait+0x127/0x280 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03ad3c9>] ? nfs_wb_page+0x79/0xd0 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff810d5874>] ? __remove_from_page_cache+0x34/0xf0
Mar 16 22:44:40 bacchus kernel: [<ffffffffa039ba1f>] ? nfs_release_page+0x6f/0x90 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e0571>] ? shrink_page_list+0x431/0x5e0
Mar 16 22:44:40 bacchus kernel: [<ffffffff81337140>] ? thread_return+0x4e/0x7ae
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e0aab>] ? shrink_inactive_list+0x38b/0x7e0
Mar 16 22:44:40 bacchus kernel: [<ffffffff8104cae2>] ? finish_task_switch+0x42/0xc0
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e1215>] ? shrink_zone+0x265/0x370
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e1b19>] ? balance_pgdat+0x639/0x690
Mar 16 22:44:40 bacchus kernel: [<ffffffff810dedb0>] ? isolate_pages_global+0x0/0x240
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e1b70>] ? kswapd+0x0/0x120
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e1c4d>] ? kswapd+0xdd/0x120
Mar 16 22:44:40 bacchus kernel: [<ffffffff81075fa0>] ? autoremove_wake_function+0x0/0x30
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e1b70>] ? kswapd+0x0/0x120
Mar 16 22:44:40 bacchus kernel: [<ffffffff81075bde>] ? kthread+0x8e/0xa0
Mar 16 22:44:40 bacchus kernel: [<ffffffff8101315a>] ? child_rip+0xa/0x20
Mar 16 22:44:40 bacchus kernel: [<ffffffff81075b50>] ? kthread+0x0/0xa0
Mar 16 22:44:40 bacchus kernel: [<ffffffff81013150>] ? child_rip+0x0/0x20
Mar 16 22:44:40 bacchus kernel: INFO: task thunar:21872 blocked for more than 120 seconds.
Mar 16 22:44:40 bacchus kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 16 22:44:40 bacchus kernel: thunar        D 0000000000000002     0 21872   4196 0x00000004
Mar 16 22:44:40 bacchus kernel: ffff8800bcbb9410 0000000000000082 ffffffff81012a4e 000000000085ca12
Mar 16 22:44:40 bacchus kernel: ffff88005ecbb008 0000000000000001 ffff88013e809410 ffff8801199a8000
Mar 16 22:44:40 bacchus kernel: ffff8800bcbb96b8 000000000000f908 ffff8800bcbb96b8 ffff8801199a8000
Mar 16 22:44:40 bacchus kernel: Call Trace:
Mar 16 22:44:40 bacchus kernel: [<ffffffff81012a4e>] ? common_interrupt+0xe/0x13
Mar 16 22:44:40 bacchus kernel: [<ffffffff810b0c52>] ? delayacct_end+0x82/0xa0
Mar 16 22:44:40 bacchus kernel: [<ffffffff81019915>] ? read_tsc+0x5/0x20
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03a7760>] ? nfs_wait_bit_uninterruptible+0x0/0x10 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff8133790b>] ? io_schedule+0x6b/0xb0
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03a7769>] ? nfs_wait_bit_uninterruptible+0x9/0x10 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03a7760>] ? nfs_wait_bit_uninterruptible+0x0/0x10 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff8133806f>] ? __wait_on_bit+0x4f/0x80
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03a7760>] ? nfs_wait_bit_uninterruptible+0x0/0x10 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff81338117>] ? out_of_line_wait_on_bit+0x77/0x90
Mar 16 22:44:40 bacchus kernel: [<ffffffff81075fd0>] ? wake_bit_function+0x0/0x40
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03acc07>] ? nfs_sync_mapping_wait+0x127/0x280 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03ad3c9>] ? nfs_wb_page+0x79/0xd0 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff810d5874>] ? __remove_from_page_cache+0x34/0xf0
Mar 16 22:44:40 bacchus kernel: [<ffffffffa039ba1f>] ? nfs_release_page+0x6f/0x90 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e0571>] ? shrink_page_list+0x431/0x5e0
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e0cc9>] ? shrink_inactive_list+0x5a9/0x7e0
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e0aab>] ? shrink_inactive_list+0x38b/0x7e0
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e1215>] ? shrink_zone+0x265/0x370
Mar 16 22:44:40 bacchus kernel: [<ffffffff81019915>] ? read_tsc+0x5/0x20
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e2207>] ? try_to_free_pages+0x267/0x3e0
Mar 16 22:44:40 bacchus kernel: [<ffffffff810dedb0>] ? isolate_pages_global+0x0/0x240
Mar 16 22:44:40 bacchus kernel: [<ffffffff810dab66>] ? __alloc_pages_nodemask+0x446/0x700
Mar 16 22:44:40 bacchus kernel: [<ffffffff810d42f6>] ? grab_cache_page_write_begin+0x96/0xe0
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03ad6f1>] ? nfs_updatepage+0x2d1/0x5a0 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffffa039be70>] ? nfs_write_begin+0x70/0x220 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff810d4ed3>] ? generic_file_buffered_write+0x133/0x300
Mar 16 22:44:40 bacchus kernel: [<ffffffff810d548f>] ? __generic_file_aio_write+0x23f/0x460
Mar 16 22:44:40 bacchus kernel: [<ffffffff8133861d>] ? __mutex_lock_slowpath+0x23d/0x330
Mar 16 22:44:40 bacchus kernel: [<ffffffff810d571d>] ? generic_file_aio_write+0x6d/0xe0
Mar 16 22:44:40 bacchus kernel: [<ffffffffa039ce52>] ? nfs_file_write+0x102/0x250 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff811147d2>] ? do_sync_write+0xe2/0x120
Mar 16 22:44:40 bacchus kernel: [<ffffffff811c4c24>] ? rb_insert_color+0x94/0x140
Mar 16 22:44:40 bacchus kernel: [<ffffffff81075fa0>] ? autoremove_wake_function+0x0/0x30
Mar 16 22:44:40 bacchus kernel: [<ffffffff8104cae2>] ? finish_task_switch+0x42/0xc0
Mar 16 22:44:40 bacchus kernel: [<ffffffff81337140>] ? thread_return+0x4e/0x7ae
Mar 16 22:44:40 bacchus kernel: [<ffffffff811153d8>] ? vfs_write+0xb8/0x1a0
Mar 16 22:44:40 bacchus kernel: [<ffffffff811155ae>] ? sys_write+0x4e/0x90
Mar 16 22:44:40 bacchus kernel: [<ffffffff810120c2>] ? system_call_fastpath+0x16/0x1b
Mar 16 22:44:40 bacchus kernel: INFO: task flush-0:16:21875 blocked for more than 120 seconds.
Mar 16 22:44:40 bacchus kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 16 22:44:40 bacchus kernel: flush-0:16    D 0000000000000002     0 21875      2 0x00000000
Mar 16 22:44:40 bacchus kernel: ffff8800bc877860 0000000000000046 0000000000000000 ffffffff81337140
Mar 16 22:44:40 bacchus kernel: 0000000000000108 0000000000000202 ffff88013f86b580 ffff8801246d6000
Mar 16 22:44:40 bacchus kernel: ffff8800bc877b08 000000000000f908 ffff8800bc877b08 ffff8801246d6000
Mar 16 22:44:40 bacchus kernel: Call Trace:
Mar 16 22:44:40 bacchus kernel: [<ffffffff81337140>] ? thread_return+0x4e/0x7ae
Mar 16 22:44:40 bacchus kernel: [<ffffffff810b0c52>] ? delayacct_end+0x82/0xa0
Mar 16 22:44:40 bacchus kernel: [<ffffffff81019915>] ? read_tsc+0x5/0x20
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03a7760>] ? nfs_wait_bit_uninterruptible+0x0/0x10 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff8133790b>] ? io_schedule+0x6b/0xb0
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03a7769>] ? nfs_wait_bit_uninterruptible+0x9/0x10 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03a7760>] ? nfs_wait_bit_uninterruptible+0x0/0x10 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff8133806f>] ? __wait_on_bit+0x4f/0x80
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03a7760>] ? nfs_wait_bit_uninterruptible+0x0/0x10 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff81338117>] ? out_of_line_wait_on_bit+0x77/0x90
Mar 16 22:44:40 bacchus kernel: [<ffffffff81075fd0>] ? wake_bit_function+0x0/0x40
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03acc07>] ? nfs_sync_mapping_wait+0x127/0x280 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03ad3c9>] ? nfs_wb_page+0x79/0xd0 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffffa039ba1f>] ? nfs_release_page+0x6f/0x90 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e0571>] ? shrink_page_list+0x431/0x5e0
Mar 16 22:44:40 bacchus kernel: [<ffffffff810dee61>] ? isolate_pages_global+0xb1/0x240
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e0aab>] ? shrink_inactive_list+0x38b/0x7e0
Mar 16 22:44:40 bacchus kernel: [<ffffffff810d8ac6>] ? __rmqueue+0x286/0x480
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e1215>] ? shrink_zone+0x265/0x370
Mar 16 22:44:40 bacchus kernel: [<ffffffff81019915>] ? read_tsc+0x5/0x20
Mar 16 22:44:40 bacchus kernel: [<ffffffff810e2207>] ? try_to_free_pages+0x267/0x3e0
Mar 16 22:44:40 bacchus kernel: [<ffffffff810dedb0>] ? isolate_pages_global+0x0/0x240
Mar 16 22:44:40 bacchus kernel: [<ffffffff810dab66>] ? __alloc_pages_nodemask+0x446/0x700
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03ad079>] ? nfs_writedata_alloc+0x69/0xb0 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff8110b4c3>] ? __slab_alloc+0x1d3/0x810
Mar 16 22:44:40 bacchus kernel: [<ffffffff8110cb36>] ? __kmalloc+0x236/0x250
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03ad079>] ? nfs_writedata_alloc+0x69/0xb0 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff811c3f84>] ? radix_tree_gang_lookup_tag_slot+0x94/0xf0
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03ad0d4>] ? nfs_flush_one+0x14/0xe0 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03a758a>] ? nfs_pageio_doio+0x2a/0x70 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03a761b>] ? nfs_pageio_add_request+0x4b/0xf0 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03ababf>] ? nfs_do_writepage+0x13f/0x1a0 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03ac196>] ? nfs_writepages_callback+0x16/0x40 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff810db971>] ? write_cache_pages+0x1e1/0x440
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03ac180>] ? nfs_writepages_callback+0x0/0x40 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03ac0f0>] ? nfs_writepages+0xe0/0x170 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffffa03ad0c0>] ? nfs_flush_one+0x0/0xe0 [nfs]
Mar 16 22:44:40 bacchus kernel: [<ffffffff81136063>] ? writeback_single_inode+0xf3/0x3c0
Mar 16 22:44:40 bacchus kernel: [<ffffffff811370c3>] ? writeback_inodes_wb+0x403/0x5d0
Mar 16 22:44:40 bacchus kernel: [<ffffffff8113739d>] ? wb_writeback+0x10d/0x1e0
Mar 16 22:44:40 bacchus kernel: [<ffffffff81137631>] ? wb_do_writeback+0xe1/0x1f0
Mar 16 22:44:40 bacchus kernel: [<ffffffff81067ea0>] ? process_timeout+0x0/0x10
Mar 16 22:44:40 bacchus kernel: [<ffffffff81137783>] ? bdi_writeback_task+0x43/0xd0
Mar 16 22:44:40 bacchus kernel: [<ffffffff810ed120>] ? bdi_start_fn+0x0/0xf0
Mar 16 22:44:40 bacchus kernel: [<ffffffff810ed19e>] ? bdi_start_fn+0x7e/0xf0
Mar 16 22:44:40 bacchus kernel: [<ffffffff810ed120>] ? bdi_start_fn+0x0/0xf0
Mar 16 22:44:40 bacchus kernel: [<ffffffff81075bde>] ? kthread+0x8e/0xa0
Mar 16 22:44:40 bacchus kernel: [<ffffffff8101315a>] ? child_rip+0xa/0x20
Mar 16 22:44:40 bacchus kernel: [<ffffffff81075b50>] ? kthread+0x0/0xa0
Mar 16 22:44:40 bacchus kernel: [<ffffffff81013150>] ? child_rip+0x0/0x20

Seems kswapd gets angry for some reason. And after that... everything that wants to be friends with it gets punched in the face, who in turn punch others in the face. Which makes my browser and lots of other stuff halt.
I cannot use the "sync" command in a terminal after this, of course. Reboot or shutdown also doesn't want to do anything, because they are waiting for the punching to stop...

Anybody have an idea what's going on here?

This is a regression on the NFS code since 2.6.32. It has been fixed on kernels 2.6.33.2 and 2.6.32.11.

Offline

Board footer

Powered by FluxBB