Btrfs - Balancing takes forever, and drive full when unallocate=0

Spider.007 · 2017-09-19 13:53:49

I have a machine with a some very weird btrfs issues (another machine with the exact same hardware / usage / setup runs fine). First of all, as soon as Device unallocated drops to 0, btrfs says the device is full. I can 'fix' this by balancing, but this takes way too long (due to 100% cpu hogging kworker). To illustrate - it took 16 hours to balance 14 groups. This is limited by CPU usage; which is 100% for various kworkers and btrfs-balance processes. I've had this problem for years on this machine; even though it always runs the most recent kernel and btrfs-progs and seen a few new filesystems.

Some more info:

Overall:
    Device size:		 894.08GiB
    Device allocated:		 858.00GiB
    Device unallocated:		  36.08GiB
    Device missing:		     0.00B
    Used:			 531.32GiB
    Free (estimated):		 179.69GiB	(min: 179.69GiB)
    Data ratio:			      2.00
    Metadata ratio:		      2.00
    Global reserve:		 512.00MiB	(used: 0.00B)

Data,RAID1: Size:421.97GiB, Used:260.31GiB
   /dev/sda2	 421.97GiB
   /dev/sdb2	 421.97GiB

Metadata,RAID1: Size:7.00GiB, Used:5.34GiB
   /dev/sda2	   7.00GiB
   /dev/sdb2	   7.00GiB

System,RAID1: Size:32.00MiB, Used:96.00KiB
   /dev/sda2	  32.00MiB
   /dev/sdb2	  32.00MiB

Unallocated:
   /dev/sda2	  18.04GiB
   /dev/sdb2	  18.04GiB

Btrfs frequently hangs while balancing:

INFO: task btrfs-balance:359 blocked for more than 120 seconds.
      Not tainted 4.12.13-1-ARCH #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
btrfs-balance   D    0   359      2 0x00000000
Call Trace:
 __schedule+0x236/0x870
 schedule+0x3d/0x90
 schedule_timeout+0x208/0x390
 wait_for_completion+0xa5/0x120
 ? wait_for_completion+0xa5/0x120
 ? wake_up_q+0x80/0x80
 btrfs_async_run_delayed_refs+0x119/0x140 [btrfs]
 __btrfs_end_transaction+0x1e9/0x2e0 [btrfs]
 btrfs_end_transaction_throttle+0x13/0x20 [btrfs]
 relocate_block_group+0x3f2/0x620 [btrfs]
 btrfs_relocate_block_group+0x18c/0x240 [btrfs]
 btrfs_relocate_chunk+0x38/0xd0 [btrfs]
 btrfs_balance+0xc10/0x13d0 [btrfs]
 ? vprintk_func+0x20/0x50
 balance_kthread+0x5b/0x80 [btrfs]
 kthread+0x125/0x140
 ? btrfs_balance+0x13d0/0x13d0 [btrfs]
 ? kthread_create_on_node+0x70/0x70
 ret_from_fork+0x25/0x30

This filesystem uses snapshots heavily, but no quota. The machine is a beefy PowerEdge R420 with two Intel S3500 ssd's in a RAID1 setup - which should be able to handle any load just fine. If the machine does anything else while balancing it completely kills itself and needs a hard reboot. I can only balance by running it from single-user mode.

How can I debug this further? What could btrfs be doing that's CPU bound (possibly single threaded) while balancing?

Last edited by Spider.007 (2017-09-19 13:55:28)

Spider.007 · 2017-09-21 17:26:20

A completely different machine suddenly breaks (completely hangs) when balancing as well. Did anyone recently succesfully balance using the current kernel?

Slithery · 2017-09-21 17:31:53

Just started one now to check for you, will report back when done.

ratcheer · 2017-09-21 23:56:00

Yes, I just balanced two filesystems in a few seconds, each. I'm on kernel 4.12.13-1-ARCH

Tim

Slithery · 2017-09-22 07:56:37

My balance completed in the usual 6 hours or so.
Kernel 4.13.2-1
3x1TB HD's in a RAID1, none of the btrfs processes hit more than 10% load and I'm running a 7+ year old CPU.

Last edited by Slithery (2017-09-22 07:57:22)

Spider.007 · 2017-09-23 12:09:39

Thanks for checking! It might be related to the Device unallocated which was 0 before both balancing on both machines. I've left the second machine running the balance to see how far it would get; but it didn't balance anything at all. (it did completely hang itself including PCI bus errors instead)

I guess I'll check this in a couple of months; the impact of a hanging balance is simply too big, since it is the rootfs, btrfs starts balancing directly when booting, even in single-user-mode I'm usually too late.

Last edited by Spider.007 (2017-09-23 12:10:20)

Arch Linux

#1 2017-09-19 13:53:49

Btrfs - Balancing takes forever, and drive full when unallocate=0

#2 2017-09-21 17:26:20

Re: Btrfs - Balancing takes forever, and drive full when unallocate=0

#3 2017-09-21 17:31:53

Re: Btrfs - Balancing takes forever, and drive full when unallocate=0

#4 2017-09-21 23:56:00

Re: Btrfs - Balancing takes forever, and drive full when unallocate=0

#5 2017-09-22 07:56:37

Re: Btrfs - Balancing takes forever, and drive full when unallocate=0

#6 2017-09-23 12:09:39

Re: Btrfs - Balancing takes forever, and drive full when unallocate=0

Board footer