You are not logged in.
I first posted it in another thread, but since it is marked "solved" I'm afraid no potential helper will pay attention.
I do have regular unrecoverable crashes of rpc.idmapd. Unfortunately on 2 machines with Kernel 3.4.x. The home directory is on a nfs4 share.
The crash just happen at write access (e.g. save as...boom), so far never at read access.
This is the kernel message:
[13272.827969] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[13272.828033] BUG: unable to handle kernel paging request at ffff880042174840
[13272.828062] IP: [<ffff880042174840>] 0xffff88004217483f
[13272.828100] PGD 180c063 PUD 1fffc067 PMD 402b3063 PTE 8000000042174163
[13272.828134] Oops: 0011 [#1] PREEMPT SMP
[13272.828162] CPU 0
[13272.828172] Modules linked in: fuse cpufreq_conservative nfsd exportfs usbhid hid nouveau evdev video mxm_wmi wmi i2c_algo_bit drm_kms_helper ttm drm edac_mce_amd i2c_nforce2 edac_core psmouse serio_raw k8temp i2c_core nv_tco fan thermal snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_pcm snd_page_alloc snd_timer snd_mixer_oss snd soundcore ac97_bus pcspkr button 8139too mii powernow_k8 mperf processor nfs nfs_acl lockd auth_rpcgss sunrpc fscache ext4 crc16 jbd2 mbcache sd_mod sr_mod cdrom ohci_hcd pata_acpi pata_amd ehci_hcd usbcore sata_nv ata_generic libata scsi_mod usb_common
[13272.828511]
[13272.828523] Pid: 484, comm: rpc.idmapd Not tainted 3.4.3-1-ARCH #1 /C.NC61-M2
[13272.828556] RIP: 0010:[<ffff880042174840>] [<ffff880042174840>] 0xffff88004217483f
[13272.828587] RSP: 0018:ffff880073459d40 EFLAGS: 00010246
[13272.828604] RAX: ffff880045affe20 RBX: ffff8800154989c0 RCX: ffff880077fdf9c0
[13272.828627] RDX: 0000000000000005 RSI: ffff880073459de9 RDI: ffff880045affe90
[13272.828650] RBP: ffff880073459d88 R08: 2222222222222222 R09: 2222222222222222
[13272.828671] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880077fdf9c0
[13272.828693] R13: ffff880045affe90 R14: ffff880073459de9 R15: 0000000000000005
[13272.828710] FS: 00007fa0c43d7700(0000) GS:ffff88007bc00000(0000) knlGS:00000000f49b7800
[13272.828710] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13272.828710] CR2: ffff880042174840 CR3: 0000000076bcc000 CR4: 00000000000007f0
[13272.828710] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[13272.828710] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[13272.828710] Process rpc.idmapd (pid: 484, threadinfo ffff880073458000, task ffff880077bacf60)
[13272.828710] Stack:
[13272.828710] ffffffff811f01fc ffff880073459db0 ffff880073459db0 ffff880077508cd0
[13272.828710] ffff880077fdf9c0 ffff880045affe90 ffff880073459de9 0000000000000005
[13272.828710] ffff8800154989c0 ffff880073459dd8 ffffffff811f0303 ffff880073459de8
[13272.828710] Call Trace:
[13272.828710] [<ffffffff811f01fc>] ? __key_instantiate_and_link+0x5c/0x100
[13272.828710] [<ffffffff811f0303>] key_instantiate_and_link+0x63/0xa0
[13272.828710] [<ffffffffa02727cd>] idmap_pipe_downcall+0x1bd/0x1e0 [nfs]
[13272.828710] [<ffffffffa01eccc9>] rpc_pipe_write+0x69/0x90 [sunrpc]
[13272.828710] [<ffffffff8116e7e8>] vfs_write+0xa8/0x180
[13272.828710] [<ffffffff8116eb2a>] sys_write+0x4a/0xa0
[13272.828710] [<ffffffff8146a7e9>] system_call_fastpath+0x16/0x1b
[13272.828710] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <02> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[13272.828710] RIP [<ffff880042174840>] 0xffff88004217483f
[13272.828710] RSP <ffff880073459d40>
[13272.828710] CR2: ffff880042174840
[13272.828710] [drm] nouveau 0000:00:05.0: Setting dpms mode 3 on tmds encoder (output 1)
[13272.848427] [drm] nouveau 0000:00:05.0: 0xC197: Parsing digital output script table
[13272.850385] [drm] nouveau 0000:00:05.0: Setting dpms mode 0 on tmds encoder (output 1)
[13272.850385] [drm] nouveau 0000:00:05.0: Output DVI-D-1 is running on CRTC 0 using output A
[13272.898685] ---[ end trace 8b5b7246e29a15c5 ]--- I changed your quote tags to code - Inxsible
Today I did count 6 crashes in one working day. Generally I still can manage to reboot via tty so no file system got damaged.
This is a bad one! Hopefully it is known.
Last edited by Inxsible (2012-06-21 14:20:52)
Offline
Exactly what kernel do you use? According to arch-dev mailinmg list there should be fixes for nfs in 3.4.2.2
Offline
Kernel version on both affected machines is 3.4.3.1. The NFS4 server has kernel version 3.3.8. Another terminal server with NFS homes and kernel 3.3.8 works perfectly.
The trouble started with 3.4.x. The crashes with the latest kernel version got a little less frequent, but far from being acceptable. I can't see any pattern beside the write access.
I can not restart the service successfully (neither nfs-common, nor rpc.idmapd) and resolve the problem this way.
Both computer having this problem are very different concerning hardware:
1. Old dual core Athlon, 2GB, old NVIDIA
2. I7 2700k, 16GB
NFS is so widely used, is there nobody else affected?
Offline
You could try the LTS kernel until it's resolved.
Offline
jcci, please use [ code ] tags as opposed to [ quote ] tags as they provide scrollers which avoid making the thread too long. Also the fonts are much easier to read. I have edited your post this time.
There's no such thing as a stupid question, but there sure are a lot of inquisitive idiots !
Offline
OK for the code tags, sorry!
As a basic requirement for using Arch (with rolling update) in a company is having a double installation with a manual sync. The spare system is proofed to run well and gets rsync-ed once the primary system proofed to be stable. We do this since 2007 and I never had to make our staff getting used to major software updates ever since. So much for the arch promotion.
So I went back to the spare system with kernel 3.3.8 and everything is OK.
Are you suggesting to try again with every kernel update? We might have a reproductive reference case here.
Last edited by jcci (2012-06-22 02:06:19)
Offline
New experiment:
I used the most recent system, everything updated just in the morning, and downgraded only the kernel to 3.3.8.
The systems that were not updated since the change of 3.3.8 to 4.x are proofed to be stable.
I was expecting everything would be OK, but after half day of working nfs was gone again, but the kernel message changed slightly:
Jun 22 12:53:41 anyhost kernel: [11756.718052] PGD 1807067 PUD 1808067 PMD 0
Jun 22 12:53:41 anyhost kernel: [11756.718074] Oops: 0000 [#1] PREEMPT SMP
Jun 22 12:53:41 anyhost kernel: [11756.718096] CPU 1
Jun 22 12:53:41 anyhost kernel: [11756.718106] Modules linked in: fuse tun cpufreq_conservative nfsd exportfs snd_hda_codec_hdmi snd_hda_codec_realtek mxm_wmi usbhid hid snd_hda_intel snd_hda_codec snd_hwdep r8169 snd_pcm serio_raw iTCO_wdt i2c_i801 pcspkr iTCO_vendor_support mii snd_page_alloc snd_timer snd soundcore kvm_intel mei(C) kvm wmi evdev coretemp acpi_cpufreq mperf processor nfs nfs_acl lockd auth_rpcgss sunrpc fscache i915 video button i2c_algo_bit intel_agp intel_gtt drm_kms_helper drm i2c_core btrfs crc32c libcrc32c zlib_deflate ext4 crc16 jbd2 mbcache ehci_hcd xhci_hcd usbcore sr_mod usb_common cdrom sd_mod ahci libahci libata scsi_mod
Jun 22 12:53:41 anyhost kernel: [11756.718416]
Jun 22 12:53:41 anyhost kernel: [11756.718425] Pid: 3285, comm: pool Tainted: G WC 3.3.8-1-ARCH #1 Gigabyte Technology Co., Ltd. Z68A-D3H-B3/Z68A-D3H-B3
Jun 22 12:53:41 anyhost kernel: [11756.718474] RIP: 0010:[<ffffffffa0398b8f>] [<ffffffffa0398b8f>] nfs_have_delegation+0x1f/0x60 [nfs]
Jun 22 12:53:41 anyhost kernel: [11756.718516] RSP: 0018:ffff88034adc3bc8 EFLAGS: 00010202
Jun 22 12:53:41 anyhost kernel: [11756.718537] RAX: ffff880329173800 RBX: 0000000000000001 RCX: 0000000000581f81
Jun 22 12:53:41 anyhost kernel: [11756.718566] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
Jun 22 12:53:41 anyhost kernel: [11756.718594] RBP: ffff88034adc3bd8 R08: 0000000000016340 R09: ffff8803ffc56340
Jun 22 12:53:41 anyhost kernel: [11756.718623] R10: ffffea000d5e9200 R11: ffffffffa0387f74 R12: 0000000000000000
Jun 22 12:53:41 anyhost kernel: [11756.718651] R13: ffff8803e384a800 R14: 0000000000000000 R15: ffff8803e846b000
Jun 22 12:53:41 anyhost kernel: [11756.718680] FS: 00007f9593007700(0000) GS:ffff8803ffc40000(0000) knlGS:0000000000000000
Jun 22 12:53:41 anyhost kernel: [11756.718712] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jun 22 12:53:41 anyhost kernel: [11756.718735] CR2: ffffffffffffffb8 CR3: 00000003d3f31000 CR4: 00000000000406e0
Jun 22 12:53:41 anyhost kernel: [11756.718764] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun 22 12:53:41 anyhost kernel: [11756.718792] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jun 22 12:53:41 anyhost kernel: [11756.718821] Process pool (pid: 3285, threadinfo ffff88034adc2000, task ffff880329173800)
Jun 22 12:53:41 anyhost kernel: [11756.718853] Stack: There might be something else involved beside the kernel version.
In errors.log I found a lot of those:
request-key: Cannot find command to construct key Could this be related?
Last edited by jcci (2012-06-22 06:02:54)
Offline