Just for the record, Bcache For The Linux Kernel Might Finally Be Ready
]]>@Tomato
If my RAID 5 or 10 setup fails that would be "bad", however adding a extra point of failure in tiered storage would not be wise I think.
That's why you should use RAID1, 5 or 10 as the backing store always. So have at least two SSDs and two HDDs in simplest "safe" configuration. But it isn't something I have any plans to enforce. Any more complex configuration is a risk-benefit trade.
The cost of a plain sata disk 3 or 4 tb is low enough to provide plenty of storage, backup up this amount of data is not. Deduping is only usefull if you have the "same data" what would primairly be Virtual Machine(s) and applications.
If you need only few TB or don't have a problem with having it in few pieces then sure, go with such setup. This software is for different workloads.
I don't know why you're going about dedupe, lvmts has nothing to do with deduping...
Maybe some music or video data (my guess a 5% if your lucky on media), however in my setup a dedupe of data would mean that my whole system would run 50 to 75% more effeciently and less costly in $$$ because of the power usage.
Would or is? Any kind of dedupe needs lots of processing power and either a lot or huge amounts of RAM. Then it still doesn't provide the IOPS or bandwidth the raw disks are capable of.
If you're running lots of similar VMs then sure, dedupe may be a better solution. But if you're running 20 VMs, each one with a different OS, different applications and different data sets, and all of them must be available, but only few of them need to be fast. Think VMs for software testing or development. You simply have no way to plan ahead which VM will see lots of use.
You may simply not have the problem I'm trying to solve.
However plain disks are tried and proven, SSD currently are know to fail esspecialy if you put them in a RAID setup (they are themself internaly RAID) so putting 2 SSD in a Stripe (raid 0) setup would increase the chance of failure because of the "more writes they would endure"
that being said, a backup of the SSD on a plain sata disk (just in case) would in my opinion would be wise. And if you made the SSD "cache" invisible: example a Virtual disk on the plain SATA disk RAID array that becomes active whenever the SSD(s) are offline would cost only some space on the RAID array. What I personaly would activate and use.lets say I would install a RAID 10 array of 4x 4TB disks, install Linux, then add TIER.
Tell TIER I have a 2nd storage what is a 128GB SSD, Tier Would ask me if I would like to cache the SSD on the RAID 10 array. Yes I answer.
Inmiddiatly I lose 128 GB of usable data on the RAID 10 array.
Now when TIER starts "collecting" its data, the 128 cache starts getting used, this cache can be "writen" to much more then a SSD would be needed, thus "saving" the lifespan of the SSD
Now writing this, a option would be to "add" a 2nd tier, 32 or 64gb of SSD for "dirty" data writes like swap, temp data (downloads .. caching or that kind of data) .. if this is not possible from a block level . on OS level a Swap/tempfs or "certain files" like Virtual disk (esx/xen/virtualbox etc)
one could make seperate "disks" and point all that data to there.. the "dirty" disk would work opposite of the caching disk.. the less "data" changes the more it needs to be "saved" on the "disk chache of the RAID 10 Array.So you would lose maybe 128Gb + 32Gb or maybe 500GB .. still the RAID 10 or RAID 5 setup .. is the only one that needs to "stay" alive.. If a SSD fails .. a reboot would "activate" the failed SSD cache on the RAID array.
Either you have the same data on SSDs and HDDs and the writes can only be as fast as HDDs allow and you treat SSDs as disposable or you allow for some data to reside only on SSDs and have faster storage, but then you have to RAID the SSDs for data to be secure. I don't support the first use-case because from what I've seen, just adding a stick or two of RAM to the server works better, is simpler, and more often than not - cheaper.
Also remember that lvmts works below the level of file system, I have no way of knowing which files are active or not. I can only see which sectors are active or not. I also have no way to ensure that file system metadata (the stuff that is essential for a file system to work) is placed on one or the other storage.
As for failure rates of SSDs. First: I've been managing computers with close to 20 SSDs bought 2 years ago in desktop and light server workloads. Even a single SSD has to fail yet. Second: the failures of SSDs are much more predicable, they are directly related to amount of data written to them. Any half decent SSD has SMART attributes that tell you how much is the SSD "used up".
also I have a 40 gb "virtual" disk that contains my primairy "backup" data, on block level this would be very easy to copy directly to a SD/USB device, the only thing that needed to be done is that this "device" is hidden from the (users), maybe you want to change on regular intervals this device to store in a safe @the-bank or @another-place. and whenever its plugged in to an other computer it is just a plain USB device with data (encryption or diskFS is up to the users configuration)
IMHO tiered FS data could go many ways ..
Impossible without file system cooperation.
]]>If my RAID 5 or 10 setup fails that would be "bad", however adding a extra point of failure in tiered storage would not be wise I think.
The cost of a plain sata disk 3 or 4 tb is low enough to provide plenty of storage, backup up this amount of data is not. Deduping is only usefull if you have the "same data" what would primairly be Virtual Machine(s) and applications.
Maybe some music or video data (my guess a 5% if your lucky on media), however in my setup a dedupe of data would mean that my whole system would run 50 to 75% more effeciently and less costly in $$$ because of the power usage.
However plain disks are tried and proven, SSD currently are know to fail esspecialy if you put them in a RAID setup (they are themself internaly RAID) so putting 2 SSD in a Stripe (raid 0) setup would increase the chance of failure because of the "more writes they would endure"
that being said, a backup of the SSD on a plain sata disk (just in case) would in my opinion would be wise. And if you made the SSD "cache" invisible: example a Virtual disk on the plain SATA disk RAID array that becomes active whenever the SSD(s) are offline would cost only some space on the RAID array. What I personaly would activate and use.
lets say I would install a RAID 10 array of 4x 4TB disks, install Linux, then add TIER.
Tell TIER I have a 2nd storage what is a 128GB SSD, Tier Would ask me if I would like to cache the SSD on the RAID 10 array. Yes I answer.
Inmiddiatly I lose 128 GB of usable data on the RAID 10 array.
Now when TIER starts "collecting" its data, the 128 cache starts getting used, this cache can be "writen" to much more then a SSD would be needed, thus "saving" the lifespan of the SSD
Now writing this, a option would be to "add" a 2nd tier, 32 or 64gb of SSD for "dirty" data writes like swap, temp data (downloads .. caching or that kind of data) .. if this is not possible from a block level . on OS level a Swap/tempfs or "certain files" like Virtual disk (esx/xen/virtualbox etc)
one could make seperate "disks" and point all that data to there.. the "dirty" disk would work opposite of the caching disk.. the less "data" changes the more it needs to be "saved" on the "disk chache of the RAID 10 Array.
So you would lose maybe 128Gb + 32Gb or maybe 500GB .. still the RAID 10 or RAID 5 setup .. is the only one that needs to "stay" alive.. If a SSD fails .. a reboot would "activate" the failed SSD cache on the RAID array.
also I have a 40 gb "virtual" disk that contains my primairy "backup" data, on block level this would be very easy to copy directly to a SD/USB device, the only thing that needed to be done is that this "device" is hidden from the (users), maybe you want to change on regular intervals this device to store in a safe @the-bank or @another-place. and whenever its plugged in to an other computer it is just a plain USB device with data (encryption or diskFS is up to the users configuration)
IMHO tiered FS data could go many ways ..
]]>@Tomato
Nice project(s) Tier & lessfs I have a few issues though.
RAID 0 ? .. so if either the SSD or a normal SATA disk fails .. the whole setup fails? if this is the case I really would like the option to "recover"
No, no chance of recovery, if you want resilience, you need to build the lvmts array using RAIDs with redundancy.
To explain, I am running 1 hardware system with 2 desktops & 4 "servers" currently I have 8 disks using about 80Watts of power.
My main goal is to reduce power consumption. So i could just buy 4x 3TB disks in raid 10 (6TB remaining for usage) + 1 or 2 SSD'sIf its possible to use a tier(ed) storage I could increase overall system performance by 200% (faster boot time etc)
Yes and yes.
Using lvmts you can effectively create MAID ( http://en.wikipedia.org/wiki/MAID ) and speed up the performance of the storage.
Note everything is roses though. In my tests I saw some bottlenecks in LVM that made it impossible to achieve more than 300% gains (pure SSD setup was over 10 times faster) with mixed setup (HDD+SSD). If you know exactly which volumes will see high activity and when, creating scripts to migrate the volumes yourself between SSD backed PV and HDD backed PV will give much better results.
By using lessFS (assuming dedupe on block level) assuming this works transparantly enough .. and my Virtual Disks are deduped (2x win7 and 4x win2k8)
Then I would be able to spin down the 2nd tier (the SATA disk RAID array)
I didn't saw good performance from LessFS, I used it with relatively small files though..
I would not mind if the SSD contents would be "mirriored" on the SATA array.(examble loss 250GB of "usable" space on the raid Array) . if a 1 or both SSD would Fail I could use a liveCD to recover via "lessfs --restore-dedupe-SSD1" command
Not possible, unlikely I'll ever do something like this. My main objective was to get a system that's as non intrusive as possible (see answer to your last question). The data can be here or there only, it resides on both disks only during migration. Otherwise you'd get a system that's as fast as the slowest disk. Only reads could be speed up in such configuration.
Off course if the RAID array fails I am very unlucky
Sorry but I have to quote somebody here, don't know who, though: "RAID is not backup" :>
Also if I understand correctly this is not a bootable solution so I need a seperate "boot" device ?
No, it's fully bootable. It's a userspace application, it uses LVM to provide the live migration functionality. LVMTS'ed volume is just as bootable as any LVM volume.
The main goal of this software was to be portable, easy to install and requiring minimal maintenance (a SOHO solution, not big enterprise where you have people being paid to grok stuff). All system level changes are performed using standard tools (lvmts is just a smart wrapper around blktrace and pvmove). This makes it fairly immune to any and all changes to internal LVM implementaion, possible bugs in the lvmts itself and so on. What's more, if it stops working completely (because one way or the other) you can clean up the mess manually (that is, if you don't want to have single LV on few PVs any more). That makes it possible for it to do basically anything LVM can do, including boot from lvmts'ed volume.
]]>Nice project(s) Tier & lessfs I have a few issues though.
RAID 0 ? .. so if either the SSD or a normal SATA disk fails .. the whole setup fails? if this is the case I really would like the option to "recover"
To explain, I am running 1 hardware system with 2 desktops & 4 "servers" currently I have 8 disks using about 80Watts of power.
My main goal is to reduce power consumption. So i could just buy 4x 3TB disks in raid 10 (6TB remaining for usage) + 1 or 2 SSD's
If its possible to use a tier(ed) storage I could increase overall system performance by 200% (faster boot time etc)
By using lessFS (assuming dedupe on block level) assuming this works transparantly enough .. and my Virtual Disks are deduped (2x win7 and 4x win2k8)
Then I would be able to spin down the 2nd tier (the SATA disk RAID array)
I would not mind if the SSD contents would be "mirriored" on the SATA array.(examble loss 250GB of "usable" space on the raid Array) . if a 1 or both SSD would Fail I could use a liveCD to recover via "lessfs --restore-dedupe-SSD1" command
Off course if the RAID array fails I am very unlucky
Also if I understand correctly this is not a bootable solution so I need a seperate "boot" device ?
]]>I've got alpha version working!
Requirements:
* LVM setup with at least two physical volumes (PVs), SSD and HDD combo recommended, fast and slow HDDs (like RAID10 with RAID5) will work too, but won't have the same impact
* root access
How to use
First, make sure you have dependencies installed:
pacman -S lvm2 blktrace confuse base-devel git
Checkout and compile:
git clone https://github.com/tomato42/lvmts.git
cd lvmts
make
Switch to root.
Mount debugfs (this has to be done on every restart):
mount -t debugfs debugfs /sys/kernel/debug
Start collector daemon (best to do it in directory you compiled the application):
./lvmtscd -f volumeGroupName-logicalVolumeName.lvmts -l /dev/volumeGroupName/logicalVolumeName
Now, while the stats are collecting, lets edit the doc/sample.conf config file.
Don't rename it, don't move it, its path is hard-coded in the application (for now), just edit.
You need to change the title of the "volume" section from "stacja-dane" to the name of file (-f parameter of lvmtscd sans ".lvmts")
Then set "LogicalVolume" and "VolumeGroup" to names of actual LVM entities. You may want to increase pvmoveWait to 5m and checkWait to 10m or more.
The the most important part, defining tiers. The highest tier (fastest) has lowest "tier" number. "path" is the name of physical volume in volume group containing LogicalVolume.
pinningScore is how much better must be extents in lower tiers to cause a swap to occur.
maxUsedSpace limits how much may be used by this LV. (there are always 5 extents left free to allow swaps between tiers)
Tiers have to have different tier numbers and non-increasing pinningScore.
When you have edited the file, just start lvmtsd:
./lvmtsd
Ctrl+C will stop it gracefully. It works on lvmtscd started in foreground (with -d switch) too.
The application should be safe to use, but as with all software, I can't guarantee it. The only two kinds of "write" access it does is writing the stats file (the -f parameter to lvmtscd) and running pvmove command (one extent at a time). So if you don't put anything weird as file name and pvmove doesn't have any latent bugs, it should be completely safe
]]><stuff about using lvmts as a means to get last read and write times of blocks>
It took a while, but I finally got the functionality needed for this working.
lvmtscd
now writes updates to file every few minutes and dumps all collected data to a file if it's Ctrl+C killed (or, in general, receives SIGTERM). The other piece of the puzzle is the
lvmtscat
command that outputs last time a block has been read, written and "hotness" score for reads and writes. Both are pushed to github.
Now to write the application that will move the extents between PVs...
]]>flashcache (requires kernel module, untested on anything older than 2.6.32)
true but flashcache is written by facebook and some other companies are testing it out too, so I guess it's getting some traction.
https://github.com/facebook/flashcache
http://highscalability.com/blog/2012/1/ … cache.html
that said, this solution is still cool since it's only in userspace and (I guess) is less invasive (uses existing DM technogoly)
]]>Some ideas for config file format (staying close to lvm.conf, as this might go mainstream with lvm one day.):
lvmts {
vg {
name = "group0";
threshold = 10; // default
lv {
name = "volume0";
read_multiplier = 0.5;
write_multiplier = 2.5;
}
lv {
name = "volume1";
read_multiplier = 1.0; // default
write_multiplier = 1.0; // default
}
pv {
uuid = "ABCD-1001";
performance = 200;
threshold = 10;
}
pv {
uuid = "EFAD-1002";
performance = 100; // default
}
}
}
This would put physical volumes in relation and add multipliers to the logical volumes.
volume0 is a database, so it will cache read data pretty well but choke on writes - mutiply all write numbers collected from stats daemon by 2.5.
pv 1001 is an SSD, set performance to higher value than 1002.
Threshold: Higher value means less movement on the storage. 10% means a move candidate from HDD must score at least 11.000 IO points if the lowest score on SSD is 10.000 IO points. This could be defined per volume group and / or per physical volume. Definition per physical volume would be great for multi-tier. For example: moving from RAID-6 to RAID-10 requires only 10%, moving from RAID-10 to SSD 30% score difference.
]]>Besides, the whole idea isn't to give higher throughput but to give higher IOPS rating for the same storage.
]]>Other than seek time, sequential reads would improve with this.
My question is how do you keep your LVM volume from becoming extremely fragmented.
I am excite about the idea. I have not been able to find a way to optimize KVM storage without knowing something about the usage pattern.
This seems like the best idea I have seen so far.
]]>If the backup solution will remember/copy the block access times at the point of taking backup, then yes, it's definitely possible (I will be keeping writes and reads statistics separately anyway). There's one "but" though: you can't ensure that blktrace is running at all times.
Yes, the overall setup must ensure blktrace is running any time a write may occur, of course. Possibly a trigger invalidating the whole bitmap when lv activation time is newer than last statistics timestamp.
I think a generation counter is required for a backup solution where blktrace is running without gaps. Like a sighup to the statistics gathering daemon to start a new blktrace, save results from previous one, kill old blktrace and increment a conter by 1. Backup could take lvm snapshot, signal stats daemon, copy according to saved stats, remove snapshot.
but if the system was restarted then you can be quite sure that the lvmtsd data is incomplete.
If you want to save rootfs, that's true. But consider a virtual machine environment or otherwise userspace managed LVs. Easy to start a daemon before any writes happen.
Creating sha1 checksum for every PE and comparing it with past value will be more effective, IMHO.
Unfortunately that means reading the whole data from the disk again on each backup. Observing changes is far superior. I always assumed a new device mapper would be required to keep that information (and for a complete, rootfs capable solution that's still the case, I think) but using blktrace is much easier to get done. Checksum based sync is more of a fallback in case the bitmap is invalidated.
I'm linking you to Matt Palmer, who just started a remote sync solution for lvm. He's exacly the guy who needs to interface with your statistics module :-)
]]>I'm going slightly off-topic here, but do you have any idea how complete the blktrace information is? You're essentially creating a pool of statistics that is interesting for other purposes, too. If all writes can be mapped precisely to their LV, a dirty-state bitmap could be used for incremental block-level backups, which is some other gap on linux and very useful for virtualization. I'd like to launch the daemon before any writes to the LVs happen, let it watch what blocks are being written to, take an LV snapshot, transfer only modified blocks off-site. I know that's like saying "while you're at it, write a backup solution, too" so I don't expect anything but maybe you can incorporate this in the modularization you mentioned.
If the backup solution will remember/copy the block access times at the point of taking backup, then yes, it's definitely possible (I will be keeping writes and reads statistics separately anyway). There's one "but" though: you can't ensure that blktrace is running at all times. Sure, if it was up and running since the time the past image was taken then there shouldn't be a problem, but if the system was restarted then you can be quite sure that the lvmtsd data is incomplete. Creating sha1 checksum for every PE and comparing it with past value will be more effective, IMHO.
]]>