btrfs snapshots and a video from Oracle's Avi Miller

graysky · 2012-01-29 13:34:32

Here is a 48 min talk on btrfs by Avi Miller of Oracle. Btrfs will be the default in the next release of OL and I think Fedora. Anyway, watch the presentation. Looks pretty cool. Copy on write is a very cool concept to me. Some questions I have for peeps running it now:

1) A snapshot every 30 sec is the default. How does this affect free space? Surely, given enough changes, the drive would eventually fill up, with snapshots of files, no?
2) Is there a GUI based solution to managing these snapshots?

EDIT: See my post #15 for an answer to changing the snapshot interval in #1.

Last edited by graysky (2013-10-10 09:52:50)

teekay · 2012-01-29 22:09:42

A snapshot every 30 seconds is not the default. Maybe in OL they make it default, but that doesn't mean you have to do it too. Every 30 seconds sounds like overkill, might be useful for large scale data stores or something though. The space needed is the diff of changed files, yes.

I read that Fedora is going to add hooks to yum, so that a snapshot gets taken prior to every system update. That sounds like a cool feature.

A snapshot is actually a subfolder in the root tree of the FS, and you could just "rm" them. So a GUI for managing snapshots might be just a file manager. Or maybe something like http://en.opensuse.org/Portal:Snapper which also provides rollback features. I'm not yet aware of any app like Yast2-snapper that runs standalone (i.e. is not a YaST module)

Last edited by teekay (2012-01-29 22:18:42)

fukawi2 · 2012-01-29 22:52:31

graysky wrote:

Here is a 48 min talk on btrfs by Avi Miller of Oracle.

Hey... That's me down the front-left

graysky wrote:

1) A snapshot every 30 sec is the default. How does this affect free space? Surely, given enough changes, the drive would eventually fill up, with snapshots of files, no?

My understanding from the presentation and talking to Avi at other times during LCA is that 30 seconds *is* the default, but it only keeps a rolling 6 (IIRC) snapshots (ie, 3 minutes worth)... The scenario being power failure, on boot it can revert to the last snapshot if the FS is unrepairable, or the one before that, or before that... The chances of the FS being unrepairable, and *6* snapshots being damaged means you killed too many hobbits and deserve to lose your data

Of course, it's not good for laptops... Disk I/O every 30 secs means hard drive can never sleep. My battery wasn't lasting a single presentation (< 1 hour) before I saw Avi's presentation and adjusted.

Last edited by fukawi2 (2012-01-29 22:54:27)

mikesd · 2012-01-30 07:43:24

fukawi2 wrote:

Of course, it's not good for laptops... Disk I/O every 30 secs means hard drive can never sleep. My battery wasn't lasting a single presentation (< 1 hour) before I saw Avi's presentation and adjusted.

Hmm, Avi mentioned it was adjustable at run time but didn't say exactly how. Is it a mount option or something else?

Avi Miller · 2012-01-30 20:30:31

graysky wrote:

1) A snapshot every 30 sec is the default. How does this affect free space? Surely, given enough changes, the drive would eventually fill up, with snapshots of files, no?

There is a bit of a misunderstanding here, which is probably due to poor delivery on my part: we are not snapshotting the file system every 30 seconds. What you're seeing is the rolling checkpoints of the start of each btree on the filesystem. We checkpoint every 30 seconds via copy-on-write of the btree metadata. The data is all copy-on-write as well. This means that btrfs can walk backwards through the checkpoints to find a tree that is consistent and mountable. As the data is also copy-on-write, the data from that checkpoint is still in place as well.

Having said that, snapshots themselves consume no space at all: again, the entire filesystem is copy-on-write. So, what happens is that the filesystem only stores the blocks that change from snapshot. So, at the moment the snapshot is taken, there is zero space consumed. As the files that have been snapshotted are changed, however, we start to store those changes on disk. This can go on indefinitely. There is no performance hit for multiple snapshots either.

2) Is there a GUI based solution to managing these snapshots?

Again, what you saw was not snapshotting, but the rolling tree checkpoints, so there is no GUI for that. There is also no GUI for btrfs that I'm aware of. Keep in mind that real btrfs snapshots appear as directories in the filesystem, so they're very simple to manage in practice.

fukawi2 · 2012-01-30 22:41:46

Welcome Avi... Does the darkside know you're here? j/k

Avi Miller wrote:

.....we are not snapshotting the file system every 30 seconds. What you're seeing is the rolling checkpoints of the start of each btree on the filesystem.

That's right, my bad, I forgot the difference. You never did show us how to disable that (eg, on laptops)?

Avi Miller wrote:

Keep in mind that real btrfs snapshots appear as directories in the filesystem, so they're very simple to manage in practice.

I still love the way you can mount a specific directory from a snapshot/checkpoint, not just the whole snapshot/checkpoint

karol · 2012-01-30 22:44:29

fukawi2 wrote:

Avi Miller wrote:
Keep in mind that real btrfs snapshots appear as directories in the filesystem, so they're very simple to manage in practice.
I still love the way you can mount a specific directory from a snapshot/checkpoint, not just the whole snapshot/checkpoint

Really? That's just yummy!

Now the only thing missing is the fsck tool ;P

Avi Miller · 2012-01-30 22:51:08

fukawi2 wrote:

Welcome Avi... Does the darkside know you're here? j/k

Oh, I get around.

That's right, my bad, I forgot the difference. You never did show us how to disable that (eg, on laptops)?

I believe it's the "notreelog" mount option, but I'm double-checking with Chris to confirm.

I still love the way you can mount a specific directory from a snapshot/checkpoint, not just the whole snapshot/checkpoint

You can't mount a specific directory, but you can mount snapshots and subvolumes all at the same time.

mikesd · 2012-01-31 04:44:20

Avi Miller wrote:

I believe it's the "notreelog" mount option, but I'm double-checking with Chris to confirm.

Ahh, I did see that mentioned at https://btrfs.wiki.kernel.org/articles/ … arted.html under mount options. I wasn't sure if that was a current list though as it doesn't mention the "recover" option you demo'd.

crabman · 2012-01-31 17:03:21

This snapper looks like a very promising tool for managing snapshots... Does anyone get it to compile on arch? If true id be happy to make a PKGBUILD, but i keep ending up with

 ../snapper/XmlFile.h:27:25: fatal error: libxml/tree.h: No such file or directory

teekay · 2012-01-31 20:05:16

Sorry for my first post, actually watched the video after I posted (d'oh), and as Avi pointed out, graysky was talking about the internal tree backups, while I was thinking of real snapshots.

Here is a PKGBUILD for snapper: https://aur.archlinux.org/packages.php?ID=56271
Of course: no YaST, no GUI. But still useful imho.

Unfortunately the path where it stores the snapshots is hardcoded to <subvol>/.snapshots
See snapper.install for a quickstart.

Last edited by teekay (2012-01-31 21:53:22)

Avi Miller · 2012-01-31 20:38:12

Avi Miller wrote:

I believe it's the "notreelog" mount option, but I'm double-checking with Chris to confirm.

I was wrong: there is currently no way to disable the checkpointing that occurs automatically with btrfs. I apologise for the confusion.

Azeem · 2012-01-31 20:43:55

What's the point of disabling it though? What are the downsides to having this enabled?

Upsides:
1 The whole system is COW anyway, so it shouldn't be weird to have old copies laying around.
2 It offers a safe way to recover a filesystem to a known good state
3 Avi mentioned in the video that this makes fsync faster

Avi Miller · 2012-01-31 20:46:10

Azeem wrote:

What's the point of disabling it though? What are the downsides to having this enabled?

The point is to lower disk/battery usage on laptop computers, by not requiring a write to the filesystem every 30 seconds. That's the only downside to having it enabled. There are lots and lots of downsides to having it disabled, which is probably why disabling it is not currently an option.

graysky · 2012-01-31 20:52:15

@Avi - is there a way to select a longer interval? Perhaps for mobile devices, 120 times / hr is too much resolution, but maybe 4 times / hr is just fine, etc.

EDIT: 10-Oct-2013 - For others interested in tweaking this, once kernel v3.12 goes live, it is now hard-coded in to be adjustable with a mount option.[1] For those wanting to change it now (3.11.x series), see the attached patch which I set to 300 sec for each snapshot.

--- a/fs/btrfs/disk-io.c	2013-09-02 16:46:10.000000000 -0400
+++ b/fs/btrfs/disk-io.c	2013-09-03 16:36:00.083873021 -0400
@@ -1737,7 +1737,7 @@
 
 	do {
 		cannot_commit = false;
-		delay = HZ * 30;
+		delay = HZ * 300;
 		mutex_lock(&root->fs_info->transaction_kthread_mutex);
 
 		spin_lock(&root->fs_info->trans_lock);
@@ -1749,7 +1749,7 @@
 
 		now = get_seconds();
 		if (cur->state < TRANS_STATE_BLOCKED &&
-		    (now < cur->start_time || now - cur->start_time < 30)) {
+		    (now < cur->start_time || now - cur->start_time < 300)) {
 			spin_unlock(&root->fs_info->trans_lock);
 			delay = HZ * 5;
 			goto sleep;

1. https://github.com/torvalds/linux/commi … bbfd169df4

Last edited by graysky (2013-10-10 09:51:54)

Azeem · 2012-01-31 20:55:24

Thanks for registering just to answer our questions Avi.

Avi Miller wrote:

This can go on indefinitely. There is no performance hit for multiple snapshots either.

Another question: I noticed that the subvolid's increment continuously despite having deleted previous subvolumes.. This makes me wonder: Is there a limited number of snapshots/subvolumes one can make in a btrfs filesystem?

Hypothetically, if I were able to maintain the same btrfs volume for years, snapshotting and deleting snapshots every 5 minutes, would I eventually exceed a limit and lose my ability to create subvolumes?

Avi Miller · 2012-01-31 21:03:43

Azeem wrote:

Thanks for registering just to answer our questions Avi.

You're welcome.

Azeem wrote:

Another question: I noticed that the subvolid's increment continuously despite having deleted previous subvolumes.. This makes me wonder: Is there a limited number of snapshots/subvolumes one can make in a btrfs filesystem?

The limit is 2^64 or 1.84467441 × 10^19 snapshots on a 64bit system. 32-bit systems don't like 64-bit inode numbers, so strange things happen after 2^32 snapshots. We recommend 64-bit for this reason with btrfs.

graysky · 2012-01-31 21:24:15

graysky wrote:

@Avi - is there a way to select a longer interval? Perhaps for mobile devices, 120 times / hr is too much resolution, but maybe 4 times / hr is just fine, etc.

No love for me, Avi

Avi Miller · 2012-01-31 21:25:46

graysky wrote:

@Avi - is there a way to select a longer interval? Perhaps for mobile devices, 120 times / hr is too much resolution, but maybe 4 times / hr is just fine, etc.

Sorry, I missed this question: there is currently no way to change the interval: Chris says, "It wouldn't be hard to add a sysfs entry to control how often btrfs commits, but we don't have that now." Patches welcome.

Note that Chris also mentioned that Jens Axboe did a lot of work finding out why laptop drivers spin and did lots to help avoid it. In practice, you get a lot more benefit from turning down the brightness of your screen. Also, if you're running SSDs, be sure to enable SATA Link Power Management as well.

Last edited by Avi Miller (2012-01-31 21:37:52)

graysky · 2012-01-31 22:15:19

@Avi - Very nice, thanks for the info. Maybe someone [a Principal Program Manager] can twist Chris' arm into adding support for this feature

Avi Miller · 2012-01-31 22:17:58

graysky wrote:

@Avi - Very nice, thanks for the info. Maybe someone [a Principal Program Manager] can twist Chris' arm into adding support for this feature

Nah, I don't have that kind of influence. A customer would be way more likely to twist arms.

graysky · 2012-01-31 22:22:56

I'm no customer but if you wanna PM me an email addy or the like, I would be glad to make the request on behave of the Arch laptop users

Avi Miller · 2012-01-31 22:24:51

graysky wrote:

I'm no customer but if you wanna PM me an email addy or the like, I would be glad to make the request on behave of the Arch laptop users

Well, Chris himself uses Arch Linux, so you may be in luck. My email address is my firstname dot lastname at oracle dot com.

bruce · 2012-02-02 02:19:13

Does the rigidity of the snapshotting (ie that it's every 30 seconds and 6 are kept) mean that if enough changes occur within that timeframe, then you'll hit issues with running out of space?
As an example (because I can't think of how to explain what I mean :-p ), I'm using nilfs now which does snapshotting as well, and if I do a pacman -Su that changes enough data, then the delta between the last snapshot and the current one is enough to fill the filesystem and hence crash pacman as it's run out of space.

Eg I have a 1.3 gig root, with ~300 meg free and if do an upgrade which is about 600 meg, then the difference between the snapshot that was made before I started the upgrade and the end of the upgrade is enough to mean I run out of space before its completed. (This is slightly different with nilfs because of the cleaner daemon, but I think the concept still carries across)

Based on how I'm reading what you're saying, this type of scenario could/would occur with the btrfs automatic snapshotting, is that right?
And if it would, could something like starting a "transaction" stop the auto snapshotting whilst a package update is going on help avert this? (then committing it once the update is done, hence closing the transaction and returning to the normal state of auto-snapshotting)

I just feel like it could be worth considering as a possible issue with the "rigid" approach... And package updates are something that does often result in a lot of changes in a very short time. They're also often the worst time to have things "go wrong" :-p

Avi Miller · 2012-02-02 02:23:49

bruce wrote:

Does the rigidity of the snapshotting (ie that it's every 30 seconds and 6 are kept) mean that if enough changes occur within that timeframe, then you'll hit issues with running out of space?

As I said in a previous post, there was a misunderstanding about what I said in my presentation: we do not snapshot every 30 seconds. We run a rolling checkpoint of the btree roots every 30 seconds. These are tiny, and are done in a circular fashion, with 4 copies kept (2 minutes worth) at any time.

As an example (because I can't think of how to explain what I mean :-p ), I'm using nilfs now which does snapshotting as well, and if I do a pacman -Su that changes enough data, then the delta between the last snapshot and the current one is enough to fill the filesystem and hence crash pacman as it's run out of space.

This same scenario may occur with btrfs, but it's unlikely: on Oracle Linux with yum-plugin-fs-snapshot, we create a snapshot of the root filesystem before we do a yum upgrade. The changes are then copy-on-write for each snapshot. Also, btrfs includes on-disk compression which can make more efficient use of free disk space. For this and other reasons, -ENOSPC on btrfs is very, very tricky.

Arch Linux

#1 2012-01-29 13:34:32

btrfs snapshots and a video from Oracle's Avi Miller

#2 2012-01-29 22:09:42

Re: btrfs snapshots and a video from Oracle's Avi Miller

#3 2012-01-29 22:52:31

Re: btrfs snapshots and a video from Oracle's Avi Miller

#4 2012-01-30 07:43:24

Re: btrfs snapshots and a video from Oracle's Avi Miller

#5 2012-01-30 20:30:31

Re: btrfs snapshots and a video from Oracle's Avi Miller

#6 2012-01-30 22:41:46

Re: btrfs snapshots and a video from Oracle's Avi Miller

#7 2012-01-30 22:44:29

Re: btrfs snapshots and a video from Oracle's Avi Miller

#8 2012-01-30 22:51:08

Re: btrfs snapshots and a video from Oracle's Avi Miller

#9 2012-01-31 04:44:20

Re: btrfs snapshots and a video from Oracle's Avi Miller

#10 2012-01-31 17:03:21

Re: btrfs snapshots and a video from Oracle's Avi Miller

#11 2012-01-31 20:05:16

Re: btrfs snapshots and a video from Oracle's Avi Miller

#12 2012-01-31 20:38:12

Re: btrfs snapshots and a video from Oracle's Avi Miller

#13 2012-01-31 20:43:55

Re: btrfs snapshots and a video from Oracle's Avi Miller

#14 2012-01-31 20:46:10

Re: btrfs snapshots and a video from Oracle's Avi Miller

#15 2012-01-31 20:52:15

Re: btrfs snapshots and a video from Oracle's Avi Miller

#16 2012-01-31 20:55:24

Re: btrfs snapshots and a video from Oracle's Avi Miller

#17 2012-01-31 21:03:43

Re: btrfs snapshots and a video from Oracle's Avi Miller

#18 2012-01-31 21:24:15

Re: btrfs snapshots and a video from Oracle's Avi Miller

#19 2012-01-31 21:25:46

Re: btrfs snapshots and a video from Oracle's Avi Miller

#20 2012-01-31 22:15:19

Re: btrfs snapshots and a video from Oracle's Avi Miller

#21 2012-01-31 22:17:58

Re: btrfs snapshots and a video from Oracle's Avi Miller

#22 2012-01-31 22:22:56

Re: btrfs snapshots and a video from Oracle's Avi Miller

#23 2012-01-31 22:24:51

Re: btrfs snapshots and a video from Oracle's Avi Miller

#24 2012-02-02 02:19:13

Re: btrfs snapshots and a video from Oracle's Avi Miller

#25 2012-02-02 02:23:49

Re: btrfs snapshots and a video from Oracle's Avi Miller

Board footer