NEW - read algorithm for raid1

rspadim · 2011-02-02 19:21:44

hi guys, i have a new idea for faster read on raid1
i was looking raid1 (mdadm) source code, and i found that it´s based on closest head
since closest head != faster disk acess
could we implement a new read balance algorithm?

my idea...
a queue for devices (a list of i/o not completed "/sys/block/mdxxx/raid_io_list")
some numbers for calculations (parameters that can be changed via "/sys/block/mdxxx" filesystem, or via "/etc/mdadm_times" file)
a calculation algorithm (inside raid array)
some modifications on real (device) read/write functions before send read/write command to device

calculation algorithm:

for each real device (disk) read/write:
calculate the time to execute that i/o (read or write), put the time to execute, and start time on raid_io_list, after execute, remove it .
on raid startup, resync, or disk fail, remove all information for failed or started device
if list get full sum all i/o and make a single big i/o (maybe always? that´s the time to think about it)

a example of /sys/block/mdxxx/raid_io_list
/dev/sda1:2010-11-11 01:01:01.00000001:12
started at 2010-11-11 01:01:01.00000001, will cost 12 seconds (calculated) to end

on /etc/mdadm_times we will have some times per device, something like this:
/dev/sda1:10:103000000:10000000

10 = time to head move (per sector, per byte or per cilinder, this for a rotational device (hard disk), for ssd is time to acces information maybe <0.1ms not based on current head position)
103000000 = sequential read bytes / second (try dd if=/dev/sda of=/dev/zero bs=4096, and get this information)
1000000 = sequential write bytes / second (try dd of=/dev/sda if=/dev/zero bs=4096, and get this information)

ok? now the algorithm
raid1 received a raid read comand:
check what mirror (devices) could prove the information
if number of devices=1, return the only device
if number of devices>1, do this:

1)calculate time to move head from current position to next position (use second parameter from /etc/mdadm_times)
2)calculate time to read total byte size (use third parameter from /etc/mdadm_times)
3)calculate total time to end io_list (use /sys/block/mdxxx/raid_io_list)
per device: example
/dev/sda1:2010-11-11 01:01:01.00000001:12

now = 2010-11-11 01:01:11.00000001
time to end io=
2010-11-11 01:01:11.00000001-(2010-11-11 01:01:01.00000001+12)=
2010-11-11 01:01:11.00000001-(2010-11-11 01:01:13.00000001)=
2 seconds to end io_list

sum 1+2+3, per device
select the device with minimal time

put that time calculated on raid_io_list

full raid_io_list? sum list and put a single i/o, start time=now, end time = sum of time to end

end.
can anyone help me? i know that we will have problem with disk scheduler (deadline, noop, cfq) but we can tweak with more time...

Last edited by rspadim (2011-02-02 19:45:42)

Dieter@be · 2011-02-02 21:02:07

rspadim wrote:

since closest head != faster disk acess

you sure? can you explain?
do you know when the mdadm scheduling happens? I never used softraid. I guess it comes after the normal io scheduler?
I guess one would use the noop scheduler then, in combination with your algorithm?

rspadim wrote:

some modifications on real (device) read/write functions before send read/write command to device

did you mean modifications on the scheduling/distribution of reads/writes? I don't get why you would change the actual read/write implementation.

rspadim · 2011-02-02 21:29:47

>rspadim wrote:
> since closest head != faster disk acess
>you sure? can you explain?
>do you know when the mdadm scheduling happens? I never used softraid. I guess it comes after the normal io scheduler?
>I guess one would use the noop scheduler then, in combination with your algorithm?

closest head is a mirror select algorithm, noop is a device i/o scheduler algorithm
round robin is a mirror select algorithm
if you have a 7000rpm disk and a 5000rpm disk on same raid1 array, with the have head position, what´s the fast?
>>>>>since closest head != faster disk acess
faster disk acess = min(time spent to read/write)
this algorithm optimize time
current (head closer) algorithm optimize device utilization, but not time if you use diferent disks (ssd+harddisk(with diferent speeds and diferent caches and diferente NCQ algorithms))

>rspadim wrote:
> some modifications on real (device) read/write functions before send read/write command to device
>did you mean modifications on the scheduling/distribution of reads/writes? I don't get why you would change the actual read/write implementation.

no no, just change raid1.c source code.....
for example:

write 'a' to device /dev/md0

raid1 receive this information and send:

write 'a' to device /dev/sda
write 'a' to device /dev/sdb
write 'a' to device /dev/sdc

at this time, we need to update 'raid_io_list' information with the new i/o being executed at /dev/sda, /dev/sdb, /dev/sdc
i don´t know if this table exists, with the i/o list to be processed and the time need to end this i/o list (that´s a time information to select which device to use)

the real write function is the function at raid1 send write to each mirror (i didn´t found how raid1.c execute real device write, i don´t understand very well the raid1.c code yet) the real write function is the mdadm code not at device scheduler level (maybe could be implemented in future, but for the first version no)

how could we implement? there´s a patch for raid1 to use round robin, with a /sys/block/md0/some_file to select round_robin ou default, we could make it, round_robin,default,time_based

i will find the patch again and post here

Last edited by rspadim (2011-02-02 21:58:30)

rspadim · 2011-02-02 21:30:42

sorry pour english, it´s not closest head, it´s nearest head

rspadim · 2011-02-02 21:32:31

i never tested, but it´s a implementation, maybe the original coder use it
http://www.spinics.net/lists/raid/msg30003.html

he call nearest head as 'biased read balancing'

another example using time based....
think about write-mostly today implementation
what it do?

it accept writes, and just execute read if all others mirros fail
when use write-mostly?
1)a device with a very 'poor' read rate (1byte/second for example, just to make it write-mostly), but a very good write rate (200MB/s for example)
2)a device with a very poor head positioning speed (a problem with drive latency, that the linux user know about it, and don´t want that this drive slow down reads/writes)
3)a device with a busy i/o list (think about a NBD over a network with a tcp/ip stack very busy, or a device that can´t handle write and read very well)
4)a device that we want to be in sync but we want to preserve it to have a bigger time to fail
5)a test device
that´s when we use write-mostly right?
what´s write-mostly to time based code? put a very small read rate (1byte/second), easy not? since a small read rate will give very high time values, it will never (maybe a day) be used
all others parameters can continue with the real device parameters

what´s the devices parameters?
time to end i/o list
time to position the head
based on a point in disk to another point, time to move head (cilinder? sector? i don´t know how to measure it with the right unit)
for ssd this time is about 0.1ms, it´s the time to ssd controller know what ROM to send request, and time to request start the READ process of the selected ROM chip
time to write/read X bytes
without head position time, just the read/write time (sequential not random read, random = sequential + head position time)
others times that we want to add to device model (maybe network model should have more, maybe disk cache, and others caches, but we need a close to real device model of read/write times)

all parameters are saved in
"/etc/mdadm_times" file, or another file in sysfs filesystem
it´s a file for one computer (linux kernel), since mdadm only run on a single computer

Last edited by rspadim (2011-02-02 21:56:48)

rspadim · 2011-02-02 22:16:10

now check this information:
time based = nearest head !!!!!!!!
=======================
why? just because it is, check:

disk1=disk2=hard disk with 7000rpm sata2 same scheduler same i/o, same time to align head, same time to read data

disk 1: head position = 1
disk 2: head position = 1

read 10bytes from raid1

what disk we will use?
time to change head position: disk1=disk2
time to read data: disk1=disk2
time to end i/o list: no i/o disk1=disk

what disk you want to use? any one

=======================
now think about a 7000rpm and a 5000rpm, head align time is the same (i will put 0seconds, real world it´s about 10ms), bytes/second is faster on 7000(for 10 bytes=10seconds) and slower on 5000 (for 10 bytes = 20 seconds)

disk 1: head position = 1
disk 2: head position = 1
=======================
read 10bytes from raid1

what disk we will use?
time to change head position: disk1=disk2 (0s)
time to read data: disk1<disk2 (10s for disk1, 20s for disk2)
time to end i/o list: no i/o disk1=disk (0s)

what disk?
disk1: 0s + 10s + 0s = 10s
disk2: 0s + 20s + 0s = 20s
disk1, 10 seconds to execut, let´s use it
=======================
read more 10bytes from raid1
disk 1 didn´t start head move (how head position is changed? it´s a internal variable or hardware based?) since my example head moviment = 0 second it doesn´t matter but could matter on real world...

what disk we will use?
time to change head position: disk1=disk2 (0s)
time to read data: disk1<disk2 (10s for disk1, 20s for disk2)
time to end i/o list: no i/o disk1>disk2 (10s for disk1, 0s for disk2)

what disk to use?
disk1: 0s + 10s + 10s = 20s
disk2: 0s + 20s + 0s = 20s
any disk again, let´s use disk1? ok disk1...

=======================
read more 10bytes from raid1
disk 1 didn´t start head move yet (how head position is changed? it´s a internal variable or hardware based?) since my example head moviment = 0 second it doesn´t matter but could matter on real world...

what disk we will use?
time to change head position: disk1=disk2 (0 seconds)
time to read data: disk1<disk2 (10 seconds for disk1, 20 for disk2)
time to end i/o list: no i/o disk1>disk2 (20 seconds for disk1, 0 seconds for disk2)

what disk to use?
disk1: 0s + 10s + 20s = 30s
disk2: 0s + 20s + 0s = 20s
disk2 is better now!!!

if head position don´t know anything about i/o list time and head position time

that´s the point, we need: time to read, time to position head, time to end i/o list
with these information we know what´s the best disk!

Last edited by rspadim (2011-02-02 22:21:55)

JGC · 2011-02-02 23:52:08

AFAIK you can't know that information because the disk doesn't expose that information, you'll have to keep track of that on a higher level. At that higher level, you don't know if the disk firmware decided to do other optimizations (NCQ) or even tried to put the disk into sleep.

In your example there's something wrong btw:

now think about a 7000rpm and a 5000rpm, head align time is the same (i will put 0seconds, real world it´s about 10ms), bytes/second is faster on 7000(for 10 bytes=10seconds) and slower on 5000 (for 10 bytes = 20 seconds)

Now let's assume disk1 is a Hitachi 7K2000 7200RPM 5-platter 2TB drive and disk2 is a 2TB WD Green drive with 4 platters spinning at 5400RPM. You specify that "head alignment time" (which I think is access time) is the same and that sequential read is twice as fast as the 5400RPM drive. This is wrong, as the access time of a 5400RPM drive is way higher than that of a 7200RPM drive, and due to the increased data density on the 5400RPM drive, that drive is hardly slower in sequential reads.

rspadim · 2011-02-03 00:42:39

>AFAIK you can't know that information because the disk doesn't expose that information, you'll have to keep track of that on a higher level. At that higher level, you don't know if the disk firmware decided to do other optimizations (NCQ) or even tried to put the disk into sleep.
that's right, if you have information about NCQ and sleep or not (maybe current rpm of disk), you can make the model of your device more real

>In your example there's something wrong btw:
> now think about a 7000rpm and a 5000rpm, head align time is the same (i will put 0seconds, real world it´s about 10ms), bytes/second is faster on 7000(for 10 bytes=10seconds) and slower on 5000 (for 10 bytes = 20 seconds)

>Now let's assume disk1 is a Hitachi 7K2000 7200RPM 5-platter 2TB drive and disk2 is a 2TB WD Green drive with 4 platters spinning at 5400RPM. You specify that "head alignment time" (which I think is access time) is the same and that sequential read is twice as fast as the 5400RPM drive. This is wrong, as the access time of a 5400RPM drive is way higher than that of a 7200RPM drive, and due to the increased data density on the 5400RPM drive, that drive is hardly slower in sequential reads.

that's right again, but since it's your disk you can indentify the model and change align time variable, read time variable and many others variables to make a better and better model of you hard drive acess time

===========================================

let's make algorithm more explicit...
===========================================
1)time to change head position: disk1=disk2 (0s)
===========================================

this could be something like:

float time_to_align_head(string device,unsigned long long current_head_position,unsigned long long new_head_position);

function return number of seconds (maybe jitter or another time unit)
device = "/dev/sda"
current_head_position = a number? i don't know how it's used today, but it's being used since today algorithm is nearest head
new_head_position = a number, may be calculated? i think yes
we have /etc/mdadm_times to help us how to calculate this time

===========================================
2)time to read data: disk1<disk2 (10s for disk1, 20s for disk2)
===========================================

float time_to_read_write(string device, unsigned long long start_position, unsigned long long bytes_to_read_write, unsigned char read_write)

function return number of seconds (maybe jitter or another time unit)
device = "/dev/sda"
start_position = a number, maybe couldn't be used, but since we have a per device i/o list we can make this function more real world
bytes_to_read_write = total number of bytes to read/write in this i/o
read_write = 1=write, 0=read

===========================================
3)time to end i/o list: no i/o disk1>disk2 (10s for disk1, 0s for disk2)
===========================================

float time_to_end_io_list(string device)
function return number of seconds (maybe jitter or another time unit)
device = "/dev/sda"
this function sum all i/o list for device

check that it's not a fixed number!!! you can change per device numbers using /etc/mdadm_times

===========================================
now check that:
first: we must identify these numbers to make our model more real (get more close to real number) for example time_to_read_write, time_to_align_head, time_to_end_io_list, with a error of 1%

since it´s a file "/etc/mdadm_times" we can online update this values...
with a cron userland program we can measure numbers and update it again

a idea for (1)
1)time to change head position:

alloc all memory that we will use
a1=0;a2=0;a3=0;
for (i=0;i<1000;i++){
read first bit of device
wait bit be read
a1=current time
read last bit of device
wait bit be read
a2= current time
a3=a3 + (a2-a1)
}
a4= a3/1000 //// mean time to move head from start to end
total number of sectors / cilinders / pages (ssd) = X
time to move head= (sectors / a4) or (cilinders / a4) or (pages /a4)

that's the first value that we must have
with this:
1)
time_to_align_head(string device,unsigned long long current_head_position,unsigned long long new_head_position){
get values for device;
return (abs(current_head - new_head_position) * time to move head)
}

===========================================
2)time to read data: disk1<disk2 (10s for disk1, 20s for disk2)
===========================================

make some read...

a1=0;a2=0;a3=0;
for (i=0;i<1000;i++){
read first bit of device
wait bit be read
a1=current time
read X bytes
wait bytes be read
a2= current time
a3=a3 + (a2-a1)
}
a4= a3/1000 //// mean time to move head from start to end
bytes / second sequential = Xbytes / a4

with this we can do:

float time_to_read_write(string device, unsigned long long start_position, unsigned long long bytes_to_read_write, unsigned char read_write){
read_write = 0.... if write we must use write on benchmark..
start_position??? maybe in future a per disk position algorithm....
get values for device;
return( bytes_to_read_write * byte/second)
}

===========================================
3)time to end i/o list: no i/o disk1>disk2 (10s for disk1, 0s for disk2)
===========================================
return sum of all i/o time to end

it's a model... better then nearest head..... got the point? we can make benchmark, and update our model (per device), ok it's not near to real world, but more and more we can improve it, more and more we can optimize.
since variables could be in memory (bytes/second, head time/bytes/sector/another unit, i/o list) it's very fast... and low cpu use... just *, and + (function 1,2,3)
maybe a update to /etc/mdadm_time, could be update to kernel memory using a: cat /etc/mdadm_time > /proc/mdadm_update_times_from_string

got? it's better than any "non model based" algorithm....

today mirror select algorithm is based on: hard disk (rotational=1), same head align time, same read/write time, no i/o queue, check what's the minimal number between current head position and next head position (ok it's a simple model, but not optimized for real world... just if you have same disks model on array and with same date of production and same many things..)

using round robin: doesn't matter what type of disk, doesn't matter speed, doesn't matter i/o queue, just send commands in a sequential order starting from first to last device
(it's a simple model, but not optimized for real world... just for ssd that have a constant time to access and read/write)

today algorithm is not very well optimized... we can improve it... i'm right?

Last edited by rspadim (2011-02-03 00:49:19)

rspadim · 2011-02-03 00:54:38

>AFAIK you can't know that information because the disk doesn't expose that information, you'll have to keep track of that on a higher level. At that higher level, you don't know if the disk firmware decided to do other optimizations (NCQ) or even tried to put the disk into sleep.

we always can benchmark some values and assume that they are close to a model...
let's think about network?
a 1000mbit network doesn't work always at 1000mbits... but, for some time based information we think that's always 1000mbits (we adopt it as a real number), and use it

for example get a ssd revo-drive 2
ocz give some information:
block size=4kbyts
random read = 270MB/s
random write = 270MB/s
latency < 0.1ms

and some harddisk tell us:
block size=??? (maybe number of heads)
random read = ??? normaly sequencial reads... about 130MB/s for a good SAS, 30MB/s for a poor sata over USB
random write = ??? normaly sequencial writes... about 130MB/s for a good SAS, 30MB/s for a poor sata over USB
max latency = 10ms (time to head align)

these informations we can get with just a google at internet, we don't need benchmarks, but we could use...
got?

rspadim · 2011-02-03 00:59:30

READ this:
linux kernel implementation (bond module) for network cards... (network devices)

search for Adaptive transmit load balancing at:
http://en.wikipedia.org/wiki/Link_aggregation

Adaptive transmit load balancing
channel bonding that does not require any special switch support. The outgoing traffic is distributed according to the current load (computed relative to the speed) on each slave. Incoming traffic is received by the current slave. If the receiving slave fails, another slave takes over the MAC address of the failed receiving slave.
Adaptive load balancing
includes balance-tlb plus receive load balancing (rlb) for IPV4 traffic, and does not require any special switch support. The receive load balancing is achieved by ARP negotiation. The bonding driver intercepts the ARP Replies sent by the local system on their way out and overwrites the source hardware address with the unique hardware address of one of the slaves in the bond such that different peers use different hardware addresses for the server.

check iproute2 software:
http://lartc.org/howto/lartc.rpdb.multiple-links.html

there's more algorithms and howto for load balance of a dual internet link
i used it with a radio 1mbit/s and a adsl 256kbit/s network, i could use two networks at full speed each, i was just making download... just a little of upload (http header request)

download of a internet url = read of disks....

================================================================================================
we can have some device informations for hard disks and solid state disks...
for example:
hdparm -tT /dev/sda

there's many software that can help us today! we can optimize raid1 select mirror algorithm, believe me!

if you don't want it, just change:
echo "near head" > /sys/block/md0/raid_mirror_io_scheduler
or
echo "time based" > /sys/block/md0/raid_mirror_io_scheduler
or
echo "round robin" > /sys/block/md0/raid_mirror_io_scheduler
or
echo "raid0 stripe" > /sys/block/md0/raid_mirror_io_scheduler (!!!!!!!!!!!!!!!!!!!! this could use stripe read algorithm !!!!!!!!!!!!!!!!!!!)

Last edited by rspadim (2011-02-03 02:02:22)

rspadim · 2011-02-03 01:56:37

check that this question (algorithms for read balance) is old:
Chris Worley [ Fr, 16 Oktober 2009 21:07 ] [ ID #2019215 ]
at
http://www.issociate.de/board/post/4994 … mance.html

rspadim · 2011-02-06 00:48:01

a need some help to make sysfs work per mirror can anyone help?
files are kenerl 2.6.37
raid1.c and raid1.h is here:

http://www.spadim.com.br/raid1/

the problem:
i need to interface 3 variables to sysfs per mirror inside array:
head_distance_rate
read_sectors_rate
write_sectors_rate

files must be:
/sys/block/mdX/md/dev-XXX/head_distance_rate
/sys/block/mdX/md/dev-XXX/read_sectors_rate
/sys/block/mdX/md/dev-XXX/write_sectors_rate

thanks guys any help is welcome

Arch Linux

#1 2011-02-02 19:21:44

NEW - read algorithm for raid1

#2 2011-02-02 21:02:07

Re: NEW - read algorithm for raid1

#3 2011-02-02 21:29:47

Re: NEW - read algorithm for raid1

#4 2011-02-02 21:30:42

Re: NEW - read algorithm for raid1

#5 2011-02-02 21:32:31

Re: NEW - read algorithm for raid1

#6 2011-02-02 22:16:10

Re: NEW - read algorithm for raid1

#7 2011-02-02 23:52:08

Re: NEW - read algorithm for raid1

#8 2011-02-03 00:42:39

Re: NEW - read algorithm for raid1

#9 2011-02-03 00:54:38

Re: NEW - read algorithm for raid1

#10 2011-02-03 00:59:30

Re: NEW - read algorithm for raid1

#11 2011-02-03 01:56:37

Re: NEW - read algorithm for raid1

#12 2011-02-06 00:48:01

Re: NEW - read algorithm for raid1

Board footer