You are not logged in.
LVM Tiered Storage, aka LVMTS, is a set of daemons allowing the use of SSDs together with HDDs to create fast "hybrid" storage. Making any file system able to take advantage of fast flash disks while not having to sacrifice the space of regular HDDs.
Configuration and usage of first alpha version.
Old first post:
As some of you probably know, one of the nice features of ZFS is availability to use two-tier storage. This way you may have few fast small SSDs speeding up multi-terabyte array. Very nice, very cost efficient and... unavailable on Linux.
The only file system that promises such functionality on Linux is btrfs. Unfortunately, it's not only not implemented now, but the whole FS is still marked (rightfully) as EXPERIMENTAL.
But then I remembered that LVM can move single physical extents between devices. So, I tried finding which blocks are accessed doing a simple find:
Hardly fast.
This is a root partition of old Arch install with lots of junk (about 60 packages from AUR) installed, 12GiB total, 11GiB used, rather fragmented
So I noted all the extents that were read during the command's run (slightly under 3.5GiB) and pvmoved them to crappy class 6 SDHC card. The result was rather nice:
Over 5 fold reduction in find run time!
So, using standard kernel I'm able to speed up most used parts of any file system.
Quick HOWTO:
Download and install blktrace from AUR
Drop VM cache:
echo 1 > /proc/sys/vm/drop_cache
echo 2 > /proc/sys/vm/drop_cache
run blktrace on LVM device (/dev/mapper/vg-lv)
run typical load
parse blktrace output:
blkparse <trace output file> | awk ' { print $8 } ' | awk ' !/\[/ { print $1 / (2*4*1024) } ' | cut -f1 -d. - | sort -n | uniq -c
use pvmove to move most often read extents (second column is the logical extent on device, first - number of accesses)
I started work on a program that will monitor the block layer, rank all extents and move not used/hardly used extents to slow storage (RAID5/6 array), extents used from time to time to relatively fast storage (RAID1/10 array) and often used extents to very fast storage (SSD device, SSD based RAID). Effectively creating tiered storage with any file system available on Linux.
Unfortunately I don't have a real SSD to check if this will work well for big volumes, so I'm looking for people that are interested in speeding up their disk storage and have both big LVM volumes and SSD disks to test.
Last edited by tomato (2012-08-02 21:56:22)
sorry for my miserable english, it's my third language ; )
Offline
*very*
*cool*
< Daenyth> and he works prolifically
4 8 15 16 23 42
Offline
nice, creative :-)
i wish i had an SSD, i'd def try this. you could maybe get similar results to the above with cheaper flash, like USB sticks maybe? i have a server running several VMs using LVM as a backing... as you say "extent" im guessing this technique would need a bone fide FS to look at, not just the LVM pool itself?
anyways, cool approach. i wonder if there is some kind of DM tool/daemon to do something like this... and if not... why not? :-)
C Anthony
what am i but an extension of you?
Offline
nice, creative :-)
i wish i had an SSD, i'd def try this. you could maybe get similar results to the above with cheaper flash, like USB sticks maybe?
The most you can get out of USB is 20-25MiB/s, IMO It's slow for any real usage. I used the SD card just to see if it will have any positive impact. The write to those blocks would be rather slow I guess.
But it got me thinking, if I have 4 or 5 big drives (>1TB), I could set aside a small (~10GB) partition on each of them, RAID10 small partitions, and RAID5 the big ones. Effectively "loosing" about 20GB of space and gaining RAID10 performance for metadata and write intensive stuff (eg. file system journal) while still having lots of storage for big files.
I just like to see all the space of my drives combined in one big pool, not only I get (usually) better performance but also redundancy.
i have a server running several VMs using LVM as a backing... as you say "extent" im guessing this technique would need a bone fide FS to look at, not just the LVM pool itself?
I was talking about LVM extents, not file system extents. This will work even with raw logical volumes and some custom code writing to them.
If you use lvdisplay or pvdisplay it will show the size of volumes in PE's and LE's, those are Logical Extents and Physical Extents (usually 4MiB in size).
anyways, cool approach. i wonder if there is some kind of DM tool/daemon to do something like this... and if not... why not? :-)
C Anthony
It needs to be a daemon, monitoring the LVM(s) in real time.
The only thing I could find that comes close to this are bcache (unmaintained) and flashcache (requires kernel module, untested on anything older than 2.6.32)
sorry for my miserable english, it's my third language ; )
Offline
*very*
*cool*
Seconded! and how!
As far as the OP request goes. I have been contemplating upgrading my media server's main drive to SSD to get things a little speedier. Right now it's operating on a 250GB laptop harddrive (energy savings over full size). And my media storage is a software raid level 5 with three drives.
I may just have to contemplate implementing this just to have something to tinker with
Offline
*very*
*cool*
Thirded. Archers come up with some awesome stuff.
My home and work desktops are both running a 1tb RAID-1 with a 60gb SSD. Are there any risks with this?
Are you familiar with our Forum Rules, and How To Ask Questions The Smart Way?
BlueHackers // fscanary // resticctl
Offline
My home and work desktops are both running a 1tb RAID-1 with a 60gb SSD. Are there any risks with this?
The same as using RAID-0, if one of the drives fail, you loose all your data.
Considering that SSD failures are rather predictable (if it has SMART) then I'd say that running RAID-1 with SSD is rather safe, but I'd use RAID-1 or RAID-10 for SSDs in a production environment.
sorry for my miserable english, it's my third language ; )
Offline
Good news!
I've started the project on github.
For now it's just a very raw prototype but it makes it a bit easier to experiment.
usage:
1. compile
git clone git://github.com/tomato42/lvmts.git
cd lvmts
gcc -std=gnu99 -Wall lvmtsd.c -o lvmtsd -Os
2. install blktrace
3. clean cache:
sync
echo 1 > /proc/sys/vm/drop_caches
echo 2 > /proc/sys/vm/drop_caches
4. run
btrace -a complete /dev/volume-group/lv-name | ./lvmtsd <first physical extent of LV> <number of free extents on SSD>
5. Run your workload (the longer the better)
6. In another terminal do
killall blktrace
Don't Ctrl+C the lvmtsd or you'll loose all data.
7. lvmtsd will print extents to move, it's best to move them in small groups (it groups them in tens by default) using pvmove. For example:
pvmove /dev/md0:0:21:213:1024:1025:1026 /dev/md1
where /dev/md0 is disk RAID and /dev/md1 is SSD RAID
Note: It can't handle non continuous LVs, yet. It can handle extending the LV during work though .
sorry for my miserable english, it's my third language ; )
Offline
This is _way_ cool!
The older I get the less time I have.
Offline
i still do net get it. what is this all about? since many people said how awesome it is i also would like to unterstand it.
could anyone please clear my mind please?
"They say just hold onto your hope but you know if you swallow your pride you will choke"
Alexisonfire - Midnight Regulations
Offline
Finally got my hands on a real SSD (OCZ Vertex 2 - the thing's fast, over 100MiB/s of random access fast ) so you can expect updates in not so far future
i still do net get it. what is this all about? since many people said how awesome it is i also would like to unterstand it.
Think like this: you want to have a 4TB array, SSD of such size will set you back at least $10000, HDD RAID of similar size will cost you less than $500, even at RAID10 redundancy.
Problem is, that
find /
(or
updatedb
) will easily take more than 10 minutes if you have large files (in the 100MB range) or few dozen minutes if you have smaller files on HDDs and at most a minute with the SSD.
Now, the only thing that needs to be read from such an array is metadata, which in itself is small, less than 1%.
If you could put only metadata on a SSD you'd get a disk just as fast in such workloads as the $10000 array for a fraction of the price.
In most small to medium businesses and home media servers you don't need thousands of IOPS (which big database and large www servers need - that's why SSD exclusive storage has its place), at the same time you want to have access to your most used files be fast (Windows roaming profiles, DE configuration files), and don't have to manually move it between fast and slow storage.
Hope it cleared some confusion
sorry for my miserable english, it's my third language ; )
Offline
I'd like to join in and bump this topic, as it is really hot. I've had the same idea today and was glad to see someone is already on to it.
There are lots and lots of scenarios where moving "hot" extents strategically will boost performance significantly.
My case is a medium sized array of 12 HDDs combined with 4 SSDs (plus spares). Btw, the exact same OCZ Vertex 2 series ;-) All disks are paired raid-1 (mdadm). There will be lots of virtual machines (KVM) on this array, thus I can now choose at setup between striped across 6 axis HDD or 2 axis SSD. The LVM volume group spans across all disk pairs. So I can pvmove afterwards between SSD and HDD (almost) at will.
Having a deamon that monitors I/O traffic, maps that to LV extents and pvmoves hot extents according to a configurable strategy would be really great. Imagine the benefits: All journals of the virtual machine filesystems will be automagically selected for SSD, so *all write-latency* will speed up notably. Any loaded database (with lots of writes in the same place) will be moved to SSD in background without downtime.
Or turn it all upside down: Every LV placed on SSD first. After setup, everything cools down and "cold" extents get moved away to cheap HDD space.
One thing hard to get right is configurable strategies. Though this is currently only an experiment and far away from a product, I'd like to post some early ideas:
Setups without SSD can also benefit from this: HDDs can be short-stroked to keep seek times as low as possible. In that case, re-ordering extents on the same disk is an option.
Reads vs. writes: Servers usually only have to worry about writes, while desktops / laptops will be glad about read performance, too. A ratio knob comes to mind.
Measurements: Difficult to time right. Possibly split up into "extents hot this month", "extents hot this week" and "extents hot today" or a similar scheme.
Presets: Adjust extents layout for a specific task. Create a measurement tagged "working with archive database", save hot extents from that workload. Next time someone needs that archive, load measurement by tag and move extents accordingly. More sophisticated on top of that: Predicting recurring events. Like a list of extents gets hot only every monday. Load and unload that list automatically.
But for starters, get measurements done without performance impact and *not* loose all measurement data on ctrl-c ;-)
Last edited by korkman (2011-10-22 11:52:05)
Offline
Setups without SSD can also benefit from this: HDDs can be short-stroked to keep seek times as low as possible. In that case, re-ordering extents on the same disk is an option.
The HDD only setup can't be speed up much, the dominant part of seek latency is rotational latency: waiting for the stuff to show up under the head.
It could speed up measurably setups only if you have vastly different HDDs: 5.6k and 10k or 7.2k and 15k. Still, the difference between them isn't even in magnitude, unlike the SSDs where it's few magnitudes...
sorry for my miserable english, it's my third language ; )
Offline
Yes, rotational latency is always there, but amplified by missing the target and having to wait another round. This is especially true for high queue depths. NCQ will try and catch all queued requests in one spin, but if the targets are just too far away, they will have to wait another round.
Checkout these benchmarks:
http://www.tomshardware.com/reviews/sho … 157-8.html
They did a quite extreme corridor down to 5% capacity, but it gave them a 40-60% IOPS boost on a casual 7.200 HDD. So if you have only 5% of data in "hot" state, and re-order that even on the same drive, you will gain that boost. And today, 5% is 150 GiB.
But you made another point: There may be more than two tiers. There's a huge difference between PCIe based SSD and SATA based SSD. SATA/SAS based SSD is throttled quite a bit by SAS expanders / controllers that can't cope with the huge IOPS on a single link. So there may be a three-tier setup like this:
- PCIe SSD with 500k-1m IOPS
- SATA/SAS SSD with 10-50k IOPS
- SATA/SAS HDD with 2-4k IOPS (typical 12 disk array)
Not *that* likely, because price points of SATA/SAS SSD and PCIe SSD don't differ much in enterprise products (at least I'd expect so, as flash memory is the biggest part of the price). But people might come up with multi-tier storage of some kind and it should be possible to think of this in configuration options.
Offline
What I see more likely is using different RAID levels together with different speed disks, so you have 3 tiers:
SSDs (either PCIe or SATA/SAS) in RAID 1 or RAID 10
10k+ HDDs in RAID1 or RAID10
7.2k or 5.6k "green" HDDs in RAID6
If you really want, you could use first 5% of capacity of each of the RAIDed drives to create a RAID10 and the rest to create a RAID6 and lmvtsd would relocate the data between those PVs. I don't think I'll make relocations inside a single PV possible (and sort the data on single PV depending on "hotness"), while those are 4MiB blocks, fragmentation is still fragmentation. This could kill big file access performance.
Before you mentioned "get measurements done without performance impact". What do you mean? Running blktrace should have negligible performance impact and lvmtsd won't write much data to disk even when it's "production quality"...
sorry for my miserable english, it's my third language ; )
Offline
get measurements done without performance impact
Nothing implied, didn't test lvmtsd yet. I simply assumed blktrace must have some performance impact.
I see you're committing to the github - anything I can help testing?
Offline
Not currently. Now I just cleaned up past repo (few things that I forgot to commit before) and started re-writing it from scratch to make it more modular. The current code in git is just a prototype.
I'll post in this thread when there's a working version available for testing that doesn't require hacks to work.
sorry for my miserable english, it's my third language ; )
Offline
I'm going slightly off-topic here, but do you have any idea how complete the blktrace information is? You're essentially creating a pool of statistics that is interesting for other purposes, too. If all writes can be mapped precisely to their LV, a dirty-state bitmap could be used for incremental block-level backups, which is some other gap on linux and very useful for virtualization. I'd like to launch the daemon before any writes to the LVs happen, let it watch what blocks are being written to, take an LV snapshot, transfer only modified blocks off-site. I know that's like saying "while you're at it, write a backup solution, too" so I don't expect anything but maybe you can incorporate this in the modularization you mentioned.
Offline
I'm going slightly off-topic here, but do you have any idea how complete the blktrace information is? You're essentially creating a pool of statistics that is interesting for other purposes, too. If all writes can be mapped precisely to their LV, a dirty-state bitmap could be used for incremental block-level backups, which is some other gap on linux and very useful for virtualization. I'd like to launch the daemon before any writes to the LVs happen, let it watch what blocks are being written to, take an LV snapshot, transfer only modified blocks off-site. I know that's like saying "while you're at it, write a backup solution, too" so I don't expect anything but maybe you can incorporate this in the modularization you mentioned.
If the backup solution will remember/copy the block access times at the point of taking backup, then yes, it's definitely possible (I will be keeping writes and reads statistics separately anyway). There's one "but" though: you can't ensure that blktrace is running at all times. Sure, if it was up and running since the time the past image was taken then there shouldn't be a problem, but if the system was restarted then you can be quite sure that the lvmtsd data is incomplete. Creating sha1 checksum for every PE and comparing it with past value will be more effective, IMHO.
sorry for my miserable english, it's my third language ; )
Offline
If the backup solution will remember/copy the block access times at the point of taking backup, then yes, it's definitely possible (I will be keeping writes and reads statistics separately anyway). There's one "but" though: you can't ensure that blktrace is running at all times.
Yes, the overall setup must ensure blktrace is running any time a write may occur, of course. Possibly a trigger invalidating the whole bitmap when lv activation time is newer than last statistics timestamp.
I think a generation counter is required for a backup solution where blktrace is running without gaps. Like a sighup to the statistics gathering daemon to start a new blktrace, save results from previous one, kill old blktrace and increment a conter by 1. Backup could take lvm snapshot, signal stats daemon, copy according to saved stats, remove snapshot.
but if the system was restarted then you can be quite sure that the lvmtsd data is incomplete.
If you want to save rootfs, that's true. But consider a virtual machine environment or otherwise userspace managed LVs. Easy to start a daemon before any writes happen.
Creating sha1 checksum for every PE and comparing it with past value will be more effective, IMHO.
Unfortunately that means reading the whole data from the disk again on each backup. Observing changes is far superior. I always assumed a new device mapper would be required to keep that information (and for a complete, rootfs capable solution that's still the case, I think) but using blktrace is much easier to get done. Checksum based sync is more of a fallback in case the bitmap is invalidated.
I'm linking you to Matt Palmer, who just started a remote sync solution for lvm. He's exacly the guy who needs to interface with your statistics module :-)
Offline
don't forget the read speed of the outer ring (near the front of the partition table) reads at almost 10x the inner side.
Other than seek time, sequential reads would improve with this.
My question is how do you keep your LVM volume from becoming extremely fragmented.
I am excite about the idea. I have not been able to find a way to optimize KVM storage without knowing something about the usage pattern.
This seems like the best idea I have seen so far.
Offline
Fragmentation isn't a problem: the parts it relocates are 4MiB in size by default and can be made larger. Even if you need to wait full revolution for a read of a 4MiB extent you'd still get about 80% of throughput performance (reading 4MiB at 110MiB/s takes about 35ms, rotational latency is about 8ms, all at level of consumer grade 1TB 7200rpm HDD). If you relocate to SSD it's not a problem, readahead will kick in for rotating media when you're dealing with SSD.
Besides, the whole idea isn't to give higher throughput but to give higher IOPS rating for the same storage.
Last edited by tomato (2011-11-06 12:44:40)
sorry for my miserable english, it's my third language ; )
Offline
I wonder how well tested fragmented LVM is. E.g. lvdisplay --maps will show every fragment - or explode trying :-)
Some ideas for config file format (staying close to lvm.conf, as this might go mainstream with lvm one day.):
lvmts {
vg {
name = "group0";
threshold = 10; // default
lv {
name = "volume0";
read_multiplier = 0.5;
write_multiplier = 2.5;
}
lv {
name = "volume1";
read_multiplier = 1.0; // default
write_multiplier = 1.0; // default
}
pv {
uuid = "ABCD-1001";
performance = 200;
threshold = 10;
}
pv {
uuid = "EFAD-1002";
performance = 100; // default
}
}
}
This would put physical volumes in relation and add multipliers to the logical volumes.
volume0 is a database, so it will cache read data pretty well but choke on writes - mutiply all write numbers collected from stats daemon by 2.5.
pv 1001 is an SSD, set performance to higher value than 1002.
Threshold: Higher value means less movement on the storage. 10% means a move candidate from HDD must score at least 11.000 IO points if the lowest score on SSD is 10.000 IO points. This could be defined per volume group and / or per physical volume. Definition per physical volume would be great for multi-tier. For example: moving from RAID-6 to RAID-10 requires only 10%, moving from RAID-10 to SSD 30% score difference.
Offline
flashcache (requires kernel module, untested on anything older than 2.6.32)
true but flashcache is written by facebook and some other companies are testing it out too, so I guess it's getting some traction.
https://github.com/facebook/flashcache
http://highscalability.com/blog/2012/1/ … cache.html
that said, this solution is still cool since it's only in userspace and (I guess) is less invasive (uses existing DM technogoly)
< Daenyth> and he works prolifically
4 8 15 16 23 42
Offline
Exactly, using a kernel module is not always possible.
sorry for my miserable english, it's my third language ; )
Offline