One more question - as I understand ZFS is also a volume manager. If I give the whole disk to ZFS and create pool on it, can I later carve out a part of that disk and give it to a vdev (visible under /dev/) formatted with another filesystem?
Yes, that would be a ZVOL.
And are overheads in such case greater than when doing it with LVM?
I don't have any first hand experience, but I wouldn't be surprised if they are, and not just marginally. Don't forget, however, that you get the benefits of data integrity, transparent compression, sparse reservation and, depending on the setup, redundancy. ZFS snapshots also might be more convenient, but I've never done snapshots on LVM, so I don't know.
]]>One more question - as I understand ZFS is also a volume manager. If I give the whole disk to ZFS and create pool on it, can I later carve out a part of that disk and give it to a vdev (visible under /dev/) formatted with another filesystem?
And are overheads in such case greater than when doing it with LVM?
Well, I was using root BTRFS where the filesystem itself seemed to corrupt data. Hence putting it on Raid1 and scrubbing regularly seemed like a good idea. I still do that on some systems and had no problems since. Hence I though it might be a good idea with ZFS, too. [...] Right. But this is precisely what Raid1 could help with. The disk is 2 TiB HDD with 2 or 3 platters. If platter 1 is damaged, the copy of the data is still on another platter.
I'm afraid you might be oversimplifying the problem. Bad things can happen to data in many different ways and in many different places. We could talk about the need for ECC memory, because bit rot might actually be more common or more dangerous there, we could go on about the need for checksumming the data that goes through different buses and so on, and so on. If you are serious about it, there's plenty of information on the Internet, but I'm afraid it's exactly the sheer volume of it that's a problem on its own. And we certainly cannot hope to cover it in a forum thread -- not to mention that I'm certainly not the best person to talk about such matters.
Just a brief example: long gone are the days of the ST-506 interface, where you could be pretty sure that you know the correct physical geometry of the hard drive. Nowadays, what you guess about how one of your partitions is on one platter and the other is on another is just a shot in the dark. While there certainly is some logic behind it, you can never know it for sure (except, perhaps, if you disassemble the drive and do some very complex tests with it), and that kinda defeats the whole idea of "platter" pseudo-redundancy. And I'm not even sure why you think that the possibility for a damage entirely constrained to one platter is that high. In the head crash example, it's not just the direct physical damage to the magnetic surface that matters: more problems could actually arise from the debris that gets thrown all around and the last thing you want in a hard drive are such pieces flying all around. Again, you'd better talk to experts in the area if you want to know more about how hard drives can and are likely to fail.
This is exactly the problem. My server is Zotac AQ01 with only one SATA port.
This is a very interesting thing you mentioned here, but I'm not sure I understood it.
Does it mean ZFS can work as-if in Raid1 for only selected datasheets (subvolumes? folders?), while keeping rest of the partition as single drive? And then during scrub, if data is corrupted elsewhere, ZFS will inform me, but if it is corrupted in those selected datasheets, it will seamlessly recover it form its "raid1" copy while serving me the correct data?
A dataset in ZFS is more or less what Btrfs calls a subvolume; typically you create a dataset as part of the filesystem hierarchy which holds certain type of data: there are many examples around.
What I would do in your case is to give the whole disk to ZFS, and set on the datasets that will keep the important data the 'copies' property to either '2' or '3' (I'm almost sure you have to do this before you actually copy the data there). In the event that some part of the disk holding this data turns out to be defective, ZFS will notice the corruption and supply the data from the other copies -- provided, of course, that they hadn't been damaged either (it's still one drive). So, yes, that would be your 2-way or even 3-way mirror, speaking broadly. Better analogy might be how you could create yourself manually second, third, etc. copies of your data on the same disk. But ZFS here does it in a transparent way with the added benefit of automagically knowing when data becomes corrupt.
Here's what man zfs says about it:
copies=1 | 2 | 3
Controls the number of copies of data stored for this dataset. These copies are in addition to any redundancy
provided by the pool, for example, mirroring or RAID-Z. The copies are stored on different disks, if possible.
The space used by multiple copies is charged to the associated file and dataset, changing the used property
and counting against quotas and reservations.
Changing this property only affects newly-written data. Therefore, set this property at file system creation
time by using the -o copies=N option.
So, yes, last paragraph confirms what I just wrote about having to set the property before writing the data.
Other than that, I will have the most important 500GiB of this drive backed up to Copy service. So that will hope with off-site backup. But in general, this 2TiB disk is to serve as off-site backup for my other machines.
Seems reasonable. The only thing that I could possibly add is that there is no such thing as over-backuping.
]]>...Well, my impression is that you're overestimating the importance of bit rot, and underestimating the importance of drive failures.
Well, I was using root BTRFS where the filesystem itself seemed to corrupt data. Hence putting it on Raid1 and scrubbing regularly seemed like a good idea. I still do that on some systems and had no problems since. Hence I though it might be a good idea with ZFS, too.
you're much more likely to suffer catastrophic damage to your data due to drive problems (not just complete drive failure, but also head crashes that produce massive areas of damaged magnetic surface, etc).
Right. But this is precisely what Raid1 could help with. The disk is 2 TiB HDD with 2 or 3 platters. If platter 1 is damaged, the copy of the data is still on another platter.
If you still insist on using only one disk (sometimes, that's simply not a matter of choice after all), I strongly suggest that you use a single partition (or the whole disk) dedicated to ZFS and set the 'copies' property to '2' or even '3' for those datasets that are important to you.
This is exactly the problem. My server is Zotac AQ01 with only one SATA port.
This is a very interesting thing you mentioned here, but I'm not sure I understood it.
Does it mean ZFS can work as-if in Raid1 for only selected datasheets (subvolumes? folders?), while keeping rest of the partition as single drive? And then during scrub, if data is corrupted elsewhere, ZFS will inform me, but if it is corrupted in those selected datasheets, it will seamlessly recover it form its "raid1" copy while serving me the correct data?
Other than that, I will have the most important 500GiB of this drive backed up to Copy service. So that will hope with off-site backup. But in general, this 2TiB disk is to serve as off-site backup for my other machines.
]]>Thanks for this exhaustive clarification.
You're most welcome, but keep in mind that I'm just an ordinary system administrator, so take my words with a grain of salt.
What is your opinion on Raid1 mirror made on one physical disk with two partitions? Does it make sense?
My limited understanding is that it keeps 100% read speed, lose 50% write speed and in exchange protects against bitrot. But the last time I discussed it I found the idea universally criticised but the above positive was not addressed nor refuted.
Well, I guess you could say this is a poor man's solution to bit rot, indeed. But it's just that: a poor man's solution that's most likely much more trouble than it's worth...
On the other hand, if bit rot is not a real-life problem, then the is no point to it on today's drives which already in contain protection against such issues.
...Well, my impression is that you're overestimating the importance of bit rot, and underestimating the importance of drive failures. While I don't have any statistical information at hand, common sense tells me that you're much more likely to suffer catastrophic damage to your data due to drive problems (not just complete drive failure, but also head crashes that produce massive areas of damaged magnetic surface, etc).
If you still insist on using only one disk (sometimes, that's simply not a matter of choice after all), I strongly suggest that you use a single partition (or the whole disk) dedicated to ZFS and set the 'copies' property to '2' or even '3' for those datasets that are important to you.
But perhaps more important is another thing: what's the purpose of that data? If you consider it as an online backup, than I'd say that it's best to simply have an offline backup as well, preferable even more than one in different physical locations and perhaps even on different type of media (some small, but especially crucial things I even keep printed in hex on paper -- laugh at me, if you'd like). While you can use the built-in data integrity facilities of ZFS and Btrfs, you could achieve basically the same results with the various available tools that do file checksumming. In this case, you don't really need redundancy, although it won't hurt either (except your pocket, to an extent, of course). If, on the other hand, you will be serving the data, then redundancy obviously is an advantage, because it can protect you from downtimes. But still having the same type of backup as above is the best thing to do.
I suppose this answer is rather vague and trivial common sense, but it's hard to be more specific given the information known.
]]>What is your opinion on Raid1 mirror made on one physical disk with two partitions? Does it make sense?
My limited understanding is that it keeps 100% read speed, lose 50% write speed and in exchange protects against bitrot. But the last time I discussed it I found the idea universally criticised but the above positive was not addressed nor refuted.
On the other hand, if bit rot is not a real-life problem, then the is no point to it on today's drives which already in contain protection against such issues.
]]>I understand ZFS is not abour performance, however 50% performance of BTRFS still surprised me.
Filesystem testing is quite a complicated matter: many different use cases, many possible load patterns. As you can imagine, having a relational database puts a very different stress on the filesystem compared to a video streaming service. But even this is a very simplified picture: for example, there are significant enough differences between the various SQL implementations, streaming of live events differs from streaming pre-recorded video, etc, etc. Therefore, for myself and the requirements of my work, I've given up on speed benchmarking. The question that I ask myself is rather: is this filesystem fast enough for my needs, will I feel restricted by its speed or not?
I can't really say if those 50% that you've seen are surprising or even if they are meaningful at all. Even from a statistical point of view, the data are simply not enough to make a reasonable judgement. The only somewhat more comprehensive tests that I recall having seen, are those at Phoronix, and they seemed rather inconclusive. There have been large variations in the measurements, but one could probably guess that, on average, ZFS does tend to lag to a degree speedwise behind the other filesystems. But keep in mind as well, that the last such test is from almost two years ago, and ZoL has made much progress since then.
While I'm not a filesystems expert, if speed is really important for you, my advice would be to make tests with different filesystems as close to what the real-life use would be as possible.
You noted that for a single disk ZFS might not be the best choice. Why so? This is a server disk, mostly for storage of important data. Would ZFS non-mirrored not provide better data protection than XFS or ext4?
Perhaps you're mistaking the data integrity checksums with ECC. Both Btrfs and ZFS use checksums only for validation of data integrity, i.e. that the data blocks haven't changed on disk after they had been written. But when a damage is detected, those checksums cannot be used to reconstruct the original piece of information. The only way to do this is through data redundancy.
Technically, ZFS allows you to set a property on a dataset called 'copies': this would instruct ZFS to keep each data block of the dataset in up to three different locations on a vdev -- in your case, the vdev would be the physical disk (or partition). So, if sectors holding one of these blocks become damaged, ZFS could read the data from one of the other copies. But, of course, disks quite often fail completely, and the 'copies' property cannot save you from such failure (I think Btrfs has a similar mechanism, but it's just as restricted in terms of data recovery).
The only reasonable way of protection against total disk failures is, unsurprisingly, disk-level redundancy: mirror, raid{5,6}, etc. So, in comparison to XFS or ext4, when used on a single disk, ZFS and Btrfs offer only the benefit of knowing that your data has gone bad.
]]>You noted that for a single disk ZFS might not be the best choice. Why so? This is a server disk, mostly for storage of important data. Would ZFS non-mirrored not provide better data protection than XFS or ext4?
]]>I changed my setup by eliminating LUKS and LVM, and now I have simple MBR->ZFS. The pool gets mounted automatically so it appears the problem was LVM or LUKS+LVM.
Pretty sure the problem has been LVM. As @demizer also noted, that's an unusual setup and I'm not even sure if we would've been able to find an all-fitting solution, considering that the opposite scenario, LVM over ZFS (to be more precise, over zvol), might be more prevalent. And then we'll probably face the chicken & egg problem: should we first init the LVM volumes or the ZFS pools.
Anyway, glad to see that you've settled on a stable solution.
However, taking the opportunity, I did some benchmarks to compare BTRFS and ZFS performance on the same (single, non-mirrored) partition, and the results are surprising.
I wouldn't actually be surprised by the results. ZFS has not been designed that much for speed (which is not to say that it can't be reasonably fast) as for flexibility, feature set and, most importantly, reliability. One must also always keep in mind that ZFS has been designed initially for Solaris and porting to other operating systems may introduce some inefficiency. Most of the speed and efficiency problems with Linux I believe have already been ironed out, but if you are after speed, ZFS might indeed not be your best choice. And if you will use it on a single disk, I'd say that both ZFS and Btrfs would be suboptimal choices.
So it appears as if ZFS was 2x slower in writes, and marginally faster at reads, but only every other time. Also surprising is the low hdparm score. I would think this remains unaffected by filesystem because it tests the raw device (sda) rather than partition (sda1).
It should be unaffected, even if the partition were to be tested (it's still below the filesystem layer), and my guess is that there must've been some unaccounted for disk activity at the time, which has likely also affected the filesystem test. But again, even without such possible interference, I wouldn't be surprised to see ZFS being somewhat slower than the other filesystems. For me, even on the desktop, that's an insignificant issue, considering how ZFS never failed me despite some really unpleasant situations it has been put through. Btrfs, if we take it as comparison, is going to be a fine filesystem, I believe, but a filesystem usually matures in about a decade, so it's still not stable enough for me. I'll probably repeat myself here, but in the end it's really a question of identifying your priorities.
]]>However, taking the opportunity, I did some benchmarks to compare BTRFS and ZFS performance on the same (single, non-mirrored) partition, and the results are surprising.
The test commands are:
hdparm -Tt /dev/sda
cd /mnt/hdd
dd if=/dev/zero of=tempfile bs=1M count=1024 conv=fdatasync,notrunc
echo 3 > /proc/sys/vm/drop_caches
dd if=tempfile of=/dev/null bs=1M count=1024
dd if=tempfile of=/dev/null bs=1M count=1024
rm tempfile
The average, consistent results for BTRFS (relatime, no compression):
hdparm: 130.47 MB/sec
Write: 129 MB/s
Read: 131 MB/s
Buffer cache: 1.5 GB/s
The results for ZFS were inconsistent and kept varying (noatime, no compression):
hdparm: 80.47 MB/sec
Write: 74.7 MB/s - 80 MB/s - 98 MB/s - 77.2 MB/s
Read: 74.7 MB/s - 166 MB/s - 122 MB/s - 156 MB/s
Buffer cache: 116 MB/s - 117 MB/s - 205 MB/s - 164 MB/s
So it appears as if ZFS was 2x slower in writes, and marginally faster at reads, but only every other time. Also surprising is the low hdparm score. I would think this remains unaffected by filesystem because it tests the raw device (sda) rather than partition (sda1).
For comparison, XFS results on the same partition:
hdparm: 130.15 MB/sec
Write: 126 MB/s
Read: 136 MB/s
Buffer cache: 1.5 GB/s
Can someone comment on those results? If ZFS is that much slower on writes then I feel I should reconsider whether I should use it.
]]>I understand ZFS is also volume manager, but I am using LVM because:
1 - when I was setting this up, I was unfamiliar with ZFS volume management. In fact, I still am.
2 - I need to be able to have the option to convert this raid1 ZFS setup into, for example, single ZFS + single XFS or BTRFS without the necessity of copying the data to any other disk, if in the future I decide to abandon ZFS. This is a remote system so I need to be able to do this when I am thousands of kilometres away.
This is a wild setup. So I am uncertain if this setup worked before, and stopped working when 0.6.4 was released? If that is the case, then there is likely a problem somewhere. However, if it never worked at all, it is probably because you are using ZFS in a way that is not widely used/tested.
From your first post it seems you can import it manually. Have you tried using "zfs_force=1" on the kernel command line?
See:
mkinitcpio -H zfs
I need to be able to have the option to convert this raid1 ZFS setup into, for example, single ZFS + single XFS or BTRFS without the necessity of copying the data to any other disk, if in the future I decide to abandon ZFS. This is a remote system so I need to be able to do this when I am thousands of kilometres away.
Yes, I see your point here. I'm wondering if putting a GPT or even an MBR wouldn't have provided a simpler and thus more robust solution. LVM certainly adds some undesirable complexity, especially for filesystems like ZFS and Btrfs, and will likely affect performance too.
Anyway, let's see if we can find a solution for your current construct -- hopefully one that will also be universal, i.e. not affecting people who don't use LVM.
I did no extra configuration of this system with regards to lvm other than what comes with stock arch iso. Here is the info I hope will answer your questions
Yes, you've read my mind, thanks!
Would it be convenient for you to do another test? Could you create the file /etc/systemd/system/zfs-import-cache.service.d/start-after-lvm.conf (the directory zfs-import-cache.service.d most certainly will need to be created as well) with the following contents:
[Unit]
After=lvm2-lvmetad.socket
I'm not really sure if that will delay the start of zfs-import-cache.service enough, because, I think, the LVM devices will appear for sure only after the lvm2-pvscan@{device} service runs (which actually simply does a pvscan on a device). But that's the best that comes to my mind right now. I'll try to make a test setup similar to yours meanwhile, but unfortunately can't promise on any specific time frame.
P.S. I just did a quick check on one server where I have LVM (but not beneath ZFS), and if the above doesn't work, you might give it another try by substituting lvm2-lvmetad.service for lvm2-lvmetad.socket in the After= section.
]]>I did no extra configuration of this system with regards to lvm other than what comes with stock arch iso. Here is the info I hope will answer your questions:
# systemctl status lvm2-monitor
● lvm2-monitor.service - Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling
Loaded: loaded (/usr/lib/systemd/system/lvm2-monitor.service; disabled; vendor preset: disabled)
Active: inactive (dead)
# cat /etc/mkinitcpio.conf
MODULES="btrfs radeon mmc_core mmc_block sdhci sdhci-pci lz4 lz4_compress"
BINARIES=""
FILES=""
HOOKS="base udev autodetect modconf block filesystems keyboard fsck"
This is my setup: one dm-crypt (LUKS) container -> LVM (2 partitions on one dm-crypt container) -> ZFS mirror
OK, I see now what you had in mind explaining about RAID1. Using LVM between ZFS and dm-crypt seems rather unusual to me (ZFS is a volume manager on its own, after all) and I wasn't expecting it. But it could, in the end, be the culprit of the problem, because it's very possible that zfs-import-cache.service gets started after the dm-crypt devices get available, but before the LVM ones do.
As I seldom use LVM outside of root device contexts, could you please explain how have you set it up: is it enabled in the initramfs and/or via the lvm2-monitor systemd service?
]]># zpool get cachefile bigdata
NAME PROPERTY VALUE SOURCE
bigdata cachefile - default
@kerberizer, you are correct, I am using dm-crypt on that drive.
This is my setup: one dm-crypt (LUKS) container -> LVM (2 partitions on one dm-crypt container) -> ZFS mirror