You are not logged in.
I have a machine with a some very weird btrfs issues (another machine with the exact same hardware / usage / setup runs fine). First of all, as soon as Device unallocated drops to 0, btrfs says the device is full. I can 'fix' this by balancing, but this takes way too long (due to 100% cpu hogging kworker). To illustrate - it took 16 hours to balance 14 groups. This is limited by CPU usage; which is 100% for various kworkers and btrfs-balance processes. I've had this problem for years on this machine; even though it always runs the most recent kernel and btrfs-progs and seen a few new filesystems.
Some more info:
Overall:
Device size: 894.08GiB
Device allocated: 858.00GiB
Device unallocated: 36.08GiB
Device missing: 0.00B
Used: 531.32GiB
Free (estimated): 179.69GiB (min: 179.69GiB)
Data ratio: 2.00
Metadata ratio: 2.00
Global reserve: 512.00MiB (used: 0.00B)
Data,RAID1: Size:421.97GiB, Used:260.31GiB
/dev/sda2 421.97GiB
/dev/sdb2 421.97GiB
Metadata,RAID1: Size:7.00GiB, Used:5.34GiB
/dev/sda2 7.00GiB
/dev/sdb2 7.00GiB
System,RAID1: Size:32.00MiB, Used:96.00KiB
/dev/sda2 32.00MiB
/dev/sdb2 32.00MiB
Unallocated:
/dev/sda2 18.04GiB
/dev/sdb2 18.04GiB
Btrfs frequently hangs while balancing:
INFO: task btrfs-balance:359 blocked for more than 120 seconds.
Not tainted 4.12.13-1-ARCH #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
btrfs-balance D 0 359 2 0x00000000
Call Trace:
__schedule+0x236/0x870
schedule+0x3d/0x90
schedule_timeout+0x208/0x390
wait_for_completion+0xa5/0x120
? wait_for_completion+0xa5/0x120
? wake_up_q+0x80/0x80
btrfs_async_run_delayed_refs+0x119/0x140 [btrfs]
__btrfs_end_transaction+0x1e9/0x2e0 [btrfs]
btrfs_end_transaction_throttle+0x13/0x20 [btrfs]
relocate_block_group+0x3f2/0x620 [btrfs]
btrfs_relocate_block_group+0x18c/0x240 [btrfs]
btrfs_relocate_chunk+0x38/0xd0 [btrfs]
btrfs_balance+0xc10/0x13d0 [btrfs]
? vprintk_func+0x20/0x50
balance_kthread+0x5b/0x80 [btrfs]
kthread+0x125/0x140
? btrfs_balance+0x13d0/0x13d0 [btrfs]
? kthread_create_on_node+0x70/0x70
ret_from_fork+0x25/0x30
This filesystem uses snapshots heavily, but no quota. The machine is a beefy PowerEdge R420 with two Intel S3500 ssd's in a RAID1 setup - which should be able to handle any load just fine. If the machine does anything else while balancing it completely kills itself and needs a hard reboot. I can only balance by running it from single-user mode.
How can I debug this further? What could btrfs be doing that's CPU bound (possibly single threaded) while balancing?
Last edited by Spider.007 (2017-09-19 13:55:28)
Offline
A completely different machine suddenly breaks (completely hangs) when balancing as well. Did anyone recently succesfully balance using the current kernel?
Offline
Just started one now to check for you, will report back when done.
Offline
Yes, I just balanced two filesystems in a few seconds, each. I'm on kernel 4.12.13-1-ARCH
Tim
Offline
My balance completed in the usual 6 hours or so.
Kernel 4.13.2-1
3x1TB HD's in a RAID1, none of the btrfs processes hit more than 10% load and I'm running a 7+ year old CPU.
Last edited by Slithery (2017-09-22 07:57:22)
Offline
Thanks for checking! It might be related to the Device unallocated which was 0 before both balancing on both machines. I've left the second machine running the balance to see how far it would get; but it didn't balance anything at all. (it did completely hang itself including PCI bus errors instead)
I guess I'll check this in a couple of months; the impact of a hanging balance is simply too big, since it is the rootfs, btrfs starts balancing directly when booting, even in single-user-mode I'm usually too late.
Last edited by Spider.007 (2017-09-23 12:10:20)
Offline