You are not logged in.
<stuff about using lvmts as a means to get last read and write times of blocks>
It took a while, but I finally got the functionality needed for this working.
lvmtscd
now writes updates to file every few minutes and dumps all collected data to a file if it's Ctrl+C killed (or, in general, receives SIGTERM). The other piece of the puzzle is the
lvmtscat
command that outputs last time a block has been read, written and "hotness" score for reads and writes. Both are pushed to github.
Now to write the application that will move the extents between PVs...
Last edited by tomato (2012-06-19 08:57:31)
sorry for my miserable english, it's my third language ; )
Offline
Good news everyone!
I've got alpha version working!
Requirements:
* LVM setup with at least two physical volumes (PVs), SSD and HDD combo recommended, fast and slow HDDs (like RAID10 with RAID5) will work too, but won't have the same impact
* root access
How to use
First, make sure you have dependencies installed:
pacman -S lvm2 blktrace confuse base-devel git
Checkout and compile:
git clone https://github.com/tomato42/lvmts.git
cd lvmts
make
Switch to root.
Mount debugfs (this has to be done on every restart):
mount -t debugfs debugfs /sys/kernel/debug
Start collector daemon (best to do it in directory you compiled the application):
./lvmtscd -f volumeGroupName-logicalVolumeName.lvmts -l /dev/volumeGroupName/logicalVolumeName
Now, while the stats are collecting, lets edit the doc/sample.conf config file.
Don't rename it, don't move it, its path is hard-coded in the application (for now), just edit.
You need to change the title of the "volume" section from "stacja-dane" to the name of file (-f parameter of lvmtscd sans ".lvmts")
Then set "LogicalVolume" and "VolumeGroup" to names of actual LVM entities. You may want to increase pvmoveWait to 5m and checkWait to 10m or more.
The the most important part, defining tiers. The highest tier (fastest) has lowest "tier" number. "path" is the name of physical volume in volume group containing LogicalVolume.
pinningScore is how much better must be extents in lower tiers to cause a swap to occur.
maxUsedSpace limits how much may be used by this LV. (there are always 5 extents left free to allow swaps between tiers)
Tiers have to have different tier numbers and non-increasing pinningScore.
When you have edited the file, just start lvmtsd:
./lvmtsd
Ctrl+C will stop it gracefully. It works on lvmtscd started in foreground (with -d switch) too.
The application should be safe to use, but as with all software, I can't guarantee it. The only two kinds of "write" access it does is writing the stats file (the -f parameter to lvmtscd) and running pvmove command (one extent at a time). So if you don't put anything weird as file name and pvmove doesn't have any latent bugs, it should be completely safe
sorry for my miserable english, it's my third language ; )
Offline
@Tomato
Nice project(s) Tier & lessfs I have a few issues though.
RAID 0 ? .. so if either the SSD or a normal SATA disk fails .. the whole setup fails? if this is the case I really would like the option to "recover"
To explain, I am running 1 hardware system with 2 desktops & 4 "servers" currently I have 8 disks using about 80Watts of power.
My main goal is to reduce power consumption. So i could just buy 4x 3TB disks in raid 10 (6TB remaining for usage) + 1 or 2 SSD's
If its possible to use a tier(ed) storage I could increase overall system performance by 200% (faster boot time etc)
By using lessFS (assuming dedupe on block level) assuming this works transparantly enough .. and my Virtual Disks are deduped (2x win7 and 4x win2k8)
Then I would be able to spin down the 2nd tier (the SATA disk RAID array)
I would not mind if the SSD contents would be "mirriored" on the SATA array.(examble loss 250GB of "usable" space on the raid Array) . if a 1 or both SSD would Fail I could use a liveCD to recover via "lessfs --restore-dedupe-SSD1" command
Off course if the RAID array fails I am very unlucky
Also if I understand correctly this is not a bootable solution so I need a seperate "boot" device ?
Offline
@Tomato
Nice project(s) Tier & lessfs I have a few issues though.
RAID 0 ? .. so if either the SSD or a normal SATA disk fails .. the whole setup fails? if this is the case I really would like the option to "recover"
No, no chance of recovery, if you want resilience, you need to build the lvmts array using RAIDs with redundancy.
To explain, I am running 1 hardware system with 2 desktops & 4 "servers" currently I have 8 disks using about 80Watts of power.
My main goal is to reduce power consumption. So i could just buy 4x 3TB disks in raid 10 (6TB remaining for usage) + 1 or 2 SSD'sIf its possible to use a tier(ed) storage I could increase overall system performance by 200% (faster boot time etc)
Yes and yes.
Using lvmts you can effectively create MAID ( http://en.wikipedia.org/wiki/MAID ) and speed up the performance of the storage.
Note everything is roses though. In my tests I saw some bottlenecks in LVM that made it impossible to achieve more than 300% gains (pure SSD setup was over 10 times faster) with mixed setup (HDD+SSD). If you know exactly which volumes will see high activity and when, creating scripts to migrate the volumes yourself between SSD backed PV and HDD backed PV will give much better results.
By using lessFS (assuming dedupe on block level) assuming this works transparantly enough .. and my Virtual Disks are deduped (2x win7 and 4x win2k8)
Then I would be able to spin down the 2nd tier (the SATA disk RAID array)
I didn't saw good performance from LessFS, I used it with relatively small files though..
I would not mind if the SSD contents would be "mirriored" on the SATA array.(examble loss 250GB of "usable" space on the raid Array) . if a 1 or both SSD would Fail I could use a liveCD to recover via "lessfs --restore-dedupe-SSD1" command
Not possible, unlikely I'll ever do something like this. My main objective was to get a system that's as non intrusive as possible (see answer to your last question). The data can be here or there only, it resides on both disks only during migration. Otherwise you'd get a system that's as fast as the slowest disk. Only reads could be speed up in such configuration.
Off course if the RAID array fails I am very unlucky
Sorry but I have to quote somebody here, don't know who, though: "RAID is not backup" :>
Also if I understand correctly this is not a bootable solution so I need a seperate "boot" device ?
No, it's fully bootable. It's a userspace application, it uses LVM to provide the live migration functionality. LVMTS'ed volume is just as bootable as any LVM volume.
The main goal of this software was to be portable, easy to install and requiring minimal maintenance (a SOHO solution, not big enterprise where you have people being paid to grok stuff). All system level changes are performed using standard tools (lvmts is just a smart wrapper around blktrace and pvmove). This makes it fairly immune to any and all changes to internal LVM implementaion, possible bugs in the lvmts itself and so on. What's more, if it stops working completely (because one way or the other) you can clean up the mess manually (that is, if you don't want to have single LV on few PVs any more). That makes it possible for it to do basically anything LVM can do, including boot from lvmts'ed volume.
sorry for my miserable english, it's my third language ; )
Offline
@Tomato
If my RAID 5 or 10 setup fails that would be "bad", however adding a extra point of failure in tiered storage would not be wise I think.
The cost of a plain sata disk 3 or 4 tb is low enough to provide plenty of storage, backup up this amount of data is not. Deduping is only usefull if you have the "same data" what would primairly be Virtual Machine(s) and applications.
Maybe some music or video data (my guess a 5% if your lucky on media), however in my setup a dedupe of data would mean that my whole system would run 50 to 75% more effeciently and less costly in $$$ because of the power usage.
However plain disks are tried and proven, SSD currently are know to fail esspecialy if you put them in a RAID setup (they are themself internaly RAID) so putting 2 SSD in a Stripe (raid 0) setup would increase the chance of failure because of the "more writes they would endure"
that being said, a backup of the SSD on a plain sata disk (just in case) would in my opinion would be wise. And if you made the SSD "cache" invisible: example a Virtual disk on the plain SATA disk RAID array that becomes active whenever the SSD(s) are offline would cost only some space on the RAID array. What I personaly would activate and use.
lets say I would install a RAID 10 array of 4x 4TB disks, install Linux, then add TIER.
Tell TIER I have a 2nd storage what is a 128GB SSD, Tier Would ask me if I would like to cache the SSD on the RAID 10 array. Yes I answer.
Inmiddiatly I lose 128 GB of usable data on the RAID 10 array.
Now when TIER starts "collecting" its data, the 128 cache starts getting used, this cache can be "writen" to much more then a SSD would be needed, thus "saving" the lifespan of the SSD
Now writing this, a option would be to "add" a 2nd tier, 32 or 64gb of SSD for "dirty" data writes like swap, temp data (downloads .. caching or that kind of data) .. if this is not possible from a block level . on OS level a Swap/tempfs or "certain files" like Virtual disk (esx/xen/virtualbox etc)
one could make seperate "disks" and point all that data to there.. the "dirty" disk would work opposite of the caching disk.. the less "data" changes the more it needs to be "saved" on the "disk chache of the RAID 10 Array.
So you would lose maybe 128Gb + 32Gb or maybe 500GB .. still the RAID 10 or RAID 5 setup .. is the only one that needs to "stay" alive.. If a SSD fails .. a reboot would "activate" the failed SSD cache on the RAID array.
also I have a 40 gb "virtual" disk that contains my primairy "backup" data, on block level this would be very easy to copy directly to a SD/USB device, the only thing that needed to be done is that this "device" is hidden from the (users), maybe you want to change on regular intervals this device to store in a safe @the-bank or @another-place. and whenever its plugged in to an other computer it is just a plain USB device with data (encryption or diskFS is up to the users configuration)
IMHO tiered FS data could go many ways ..
Offline
@Tomato
If my RAID 5 or 10 setup fails that would be "bad", however adding a extra point of failure in tiered storage would not be wise I think.
That's why you should use RAID1, 5 or 10 as the backing store always. So have at least two SSDs and two HDDs in simplest "safe" configuration. But it isn't something I have any plans to enforce. Any more complex configuration is a risk-benefit trade.
The cost of a plain sata disk 3 or 4 tb is low enough to provide plenty of storage, backup up this amount of data is not. Deduping is only usefull if you have the "same data" what would primairly be Virtual Machine(s) and applications.
If you need only few TB or don't have a problem with having it in few pieces then sure, go with such setup. This software is for different workloads.
I don't know why you're going about dedupe, lvmts has nothing to do with deduping...
Maybe some music or video data (my guess a 5% if your lucky on media), however in my setup a dedupe of data would mean that my whole system would run 50 to 75% more effeciently and less costly in $$$ because of the power usage.
Would or is? Any kind of dedupe needs lots of processing power and either a lot or huge amounts of RAM. Then it still doesn't provide the IOPS or bandwidth the raw disks are capable of.
If you're running lots of similar VMs then sure, dedupe may be a better solution. But if you're running 20 VMs, each one with a different OS, different applications and different data sets, and all of them must be available, but only few of them need to be fast. Think VMs for software testing or development. You simply have no way to plan ahead which VM will see lots of use.
You may simply not have the problem I'm trying to solve.
However plain disks are tried and proven, SSD currently are know to fail esspecialy if you put them in a RAID setup (they are themself internaly RAID) so putting 2 SSD in a Stripe (raid 0) setup would increase the chance of failure because of the "more writes they would endure"
that being said, a backup of the SSD on a plain sata disk (just in case) would in my opinion would be wise. And if you made the SSD "cache" invisible: example a Virtual disk on the plain SATA disk RAID array that becomes active whenever the SSD(s) are offline would cost only some space on the RAID array. What I personaly would activate and use.lets say I would install a RAID 10 array of 4x 4TB disks, install Linux, then add TIER.
Tell TIER I have a 2nd storage what is a 128GB SSD, Tier Would ask me if I would like to cache the SSD on the RAID 10 array. Yes I answer.
Inmiddiatly I lose 128 GB of usable data on the RAID 10 array.
Now when TIER starts "collecting" its data, the 128 cache starts getting used, this cache can be "writen" to much more then a SSD would be needed, thus "saving" the lifespan of the SSD
Now writing this, a option would be to "add" a 2nd tier, 32 or 64gb of SSD for "dirty" data writes like swap, temp data (downloads .. caching or that kind of data) .. if this is not possible from a block level . on OS level a Swap/tempfs or "certain files" like Virtual disk (esx/xen/virtualbox etc)
one could make seperate "disks" and point all that data to there.. the "dirty" disk would work opposite of the caching disk.. the less "data" changes the more it needs to be "saved" on the "disk chache of the RAID 10 Array.So you would lose maybe 128Gb + 32Gb or maybe 500GB .. still the RAID 10 or RAID 5 setup .. is the only one that needs to "stay" alive.. If a SSD fails .. a reboot would "activate" the failed SSD cache on the RAID array.
Either you have the same data on SSDs and HDDs and the writes can only be as fast as HDDs allow and you treat SSDs as disposable or you allow for some data to reside only on SSDs and have faster storage, but then you have to RAID the SSDs for data to be secure. I don't support the first use-case because from what I've seen, just adding a stick or two of RAM to the server works better, is simpler, and more often than not - cheaper.
Also remember that lvmts works below the level of file system, I have no way of knowing which files are active or not. I can only see which sectors are active or not. I also have no way to ensure that file system metadata (the stuff that is essential for a file system to work) is placed on one or the other storage.
As for failure rates of SSDs. First: I've been managing computers with close to 20 SSDs bought 2 years ago in desktop and light server workloads. Even a single SSD has to fail yet. Second: the failures of SSDs are much more predicable, they are directly related to amount of data written to them. Any half decent SSD has SMART attributes that tell you how much is the SSD "used up".
also I have a 40 gb "virtual" disk that contains my primairy "backup" data, on block level this would be very easy to copy directly to a SD/USB device, the only thing that needed to be done is that this "device" is hidden from the (users), maybe you want to change on regular intervals this device to store in a safe @the-bank or @another-place. and whenever its plugged in to an other computer it is just a plain USB device with data (encryption or diskFS is up to the users configuration)
IMHO tiered FS data could go many ways ..
Impossible without file system cooperation.
sorry for my miserable english, it's my third language ; )
Offline
*Woow* for building + sharing the project! Also for the very comprehensive explanation written on post #11 (i think)
Just for the record, Bcache For The Linux Kernel Might Finally Be Ready
Seeded last month: Arch 50 gig, derivatives 1 gig
Desktop @3.3GHz 8 gig RAM, linux-ck
laptop #1 Atom 2 gig RAM, Arch linux stock i686 (6H w/ 6yrs old battery ) #2: ARM Tegra K1, 4 gig RAM, ChrOS
Atom Z520 2 gig RAM, OMV (Debian 7) kernel 3.16 bpo on SDHC | PGP Key: 0xFF0157D9
Offline