You are not logged in.

#1 2004-02-17 23:42:36

dp
Member
From: Zürich, Switzerland
Registered: 2003-05-27
Posts: 3,378
Website

ideas for bandwith optimizing

here some thinkings from me about this matter:

now pacman does this:

1) read /etc/pacman.conf
2) read repositories
3) read the first "Server" in repository-list
    3b) fallback: next "Server"-entry
4) try to download it's $REPO.db.tar.gz
    4b) fallback: next "Server"-entry
5) synch it to the local one
6) make list for "to upload" pkgs
7) try to download them from the same "Server" (-> 3)
8) installing/updating the pkgs downloaded

some conclusions:

=> step 4) costs "only" about 200kb bandwith

=> the huge bandwith consumption is in step 7) (as pkgs are often much larger than 200kb)

here optimizing ideas:

A) Try all "Server"s with ping and use the fastest responded one instead of the first in list (like the perl-skript, only automatially in pacman each time run)

B) Checksums:
    - have checksums of all db.tar.gz files on mirrors
    - compare checksum of db.tar.gz file from server with the local one ... if the same, do not download it, use the local copy (each time some kb saved and pacman reacts faster)
    - when -Suy'ing, just download checksums from archlinux.org and mirrors, then download (if needed -> compare with local checksum) a randomly (or ping-optimized) one that is the same as archlinux (as it means it has the same pkgs to download)

C) in FTP there is the possibility to recieve only a part of file e.g. from byte 213598 to byte 853952 ... originally this function is needed for continuing a stopped download
-> why not recieving the pkg from different servers in pieces and then glue them together? this would mean that you can download the pkg simultaniously from different servers at the same time at different speeds: archlinux.org can stay to 20kb and if you have a faster connection, you can recieve the other part of pkg from an other server at the speed it allows you to have the pkg using your fully bandwith and collecting the data from all servers containing this pkg

=> ideally A) B) and C) can be combinied together

what do you think?


The impossible missions are the only ones which succeed.

Offline

#2 2004-02-18 00:36:19

nifan
Member
Registered: 2003-04-10
Posts: 102

Re: ideas for bandwith optimizing

well the A) option is rather nonsense since there's a diference from latency and the actual bw, so the fastest one to awser may not be the efective fastest in terms of bw.

as for B) thats a great idea, sould have been though before big_smile

C) that would be very very good, but it could lead to some corruption issues, to minimize the corruption factors C) combined with B) would be nice, i mean, it would checksum every file from wichever server.

sorry if i dont show my point very good but i'm f*ck*n tired  wink


______
"Ignorance, the root and the stem of every evil." - Plato

Offline

#3 2004-02-18 00:50:12

Xentac
Forum Fellow
From: Victoria, BC
Registered: 2003-01-17
Posts: 1,797
Website

Re: ideas for bandwith optimizing

Most of the mirrors aren't as up to date as the archlinux server.  Right now we have no way of pushing packages to them, so they update every night (or every 6 hours of somethig).  This means that one mirror's db probably isn't the same as another's.


I have discovered that all of mans unhappiness derives from only one source, not being able to sit quietly in a room
- Blaise Pascal

Offline

#4 2004-02-18 00:56:06

nifan
Member
Registered: 2003-04-10
Posts: 102

Re: ideas for bandwith optimizing

that's the point i wanted to make when i said that checksum between the several mirrors with the main archserver would be necessary.


______
"Ignorance, the root and the stem of every evil." - Plato

Offline

#5 2004-02-18 01:04:35

andy
Member
From: Germany
Registered: 2002-10-11
Posts: 374

Re: ideas for bandwith optimizing

Xentac is right. I usually use a German mirror. Very rarely I use the master, and at times the two databases were out of sync with funky results ...

Here is my idea on that topic :
- Either :
the ordering should happen in the install script of pacman. I.e. the install script included in the pacman-package can try to order the mirror list automatically.
- Or :
pacman itself can ping all mirrors every time it syncs the package database and select the fastest responder (and puts this fastest responder to the top of the list, so that subsequent accesses are consistent). That even takes care of temporary network problems. I do realize that this takes coding effort ;-) !

Also : as Nifan said, ping is not the same as tranfer speed.

Offline

#6 2004-02-18 01:25:19

dp
Member
From: Zürich, Switzerland
Registered: 2003-05-27
Posts: 3,378
Website

Re: ideas for bandwith optimizing

nifan wrote:

well the A) option is rather nonsense since there's a diference from latency and the actual bw, so the fastest one to awser may not be the efective fastest in terms of bw.

was also my thinking, but this "nonsense" can exclude servers e.g. not responding in 200ms ... and then you will not even try to use a server that do not respond in 1/5s what is really fast excluding

i'm aware that ping!=^speed, but if you have a very bad response time, your data goes over a lot of servers or some overloaded/slow, what is not optimal for the connection ... and ping-results are fast, if you set a timeout you want

nifan wrote:

as for B) thats a great idea, sould have been though before big_smile

C) that would be very very good, but it could lead to some corruption issues, to minimize the corruption factors C) combined with B) would be nice, i mean, it would checksum every file from wichever server.

yea! B) with C) would be really great ... then with something like this, we can also manage being the best and most popular distro around :-) -> see http://bbs.archlinux.org/viewtopic.php?t=2672


The impossible missions are the only ones which succeed.

Offline

#7 2004-02-18 01:43:30

dp
Member
From: Zürich, Switzerland
Registered: 2003-05-27
Posts: 3,378
Website

Re: ideas for bandwith optimizing

about FTP-transfers of only one piece of file have a look at http://cr.yp.to/ftp/retr.html
the Chapter "The REST verb"


The impossible missions are the only ones which succeed.

Offline

#8 2004-02-18 02:59:57

i3839
Member
Registered: 2004-02-04
Posts: 1,185

Re: ideas for bandwith optimizing

If option C) is implemented then pacman can also resume previous download attempts, so that you don't need to redownload the whole file after losing the connection, which is always good.

In other posts there was talk about binary diffs, is there any new info about that? Anyone tried it? Because if the diffs are really much smaller then that is a much bigger bandwidth saver than a lot other things.

Another way to really save bandwidth is a more distributed file sharing approach, like letting clients upload as much as they download (that should theoretically fix all server bandwidth problems wink). Probably there already exists something that does this (Bittorrent?), but the question is more if than how to implement the idea.

A possibility for solving the problem of keeping all the packages of everyone up to date all the time, is using multicast to send them to everyone at once. When a new package is made it is immediately multicasted. Clients interested in the package join the multicast group and receive it. This would dramatically decrease the bandwidth usage of the server. Unfortunately there may be ISPs that don't support muticasting and I'm also afraid that those cheap hardware routers a lot people use also don't support it.

What dp proposed isn't bad, but doesn't really solve the core of the problem.

Offline

#9 2004-02-18 05:56:08

Xentac
Forum Fellow
From: Victoria, BC
Registered: 2003-01-17
Posts: 1,797
Website

Re: ideas for bandwith optimizing

Pacman already resumes downloads...


I have discovered that all of mans unhappiness derives from only one source, not being able to sit quietly in a room
- Blaise Pascal

Offline

#10 2004-02-18 15:57:01

dp
Member
From: Zürich, Switzerland
Registered: 2003-05-27
Posts: 3,378
Website

Re: ideas for bandwith optimizing

i think C) must be more clear said, sorry my english if i didnt manage to explain the first time

- it's not about continuing a broken download .. but it uses the same command for FTP that is already used and therefore downloads are continued already in pacman

C) is about this:

:: Retrieving packages from current...
kernel26-2.6.3-1:
archlinux.org           [XXXX##--XXXXXX] | 20kb/s
otherserver.some        [#---XXXXXXXXXX] | 30kb/s
other2.someelse         [XXXXXXXX###-XX] | 90kb/s
other3.someelse         [XXXXXXXXXXXX--] |  4kb/s

or just overall:

kernel26-2.6.3-1   [#---##--###---] | 144kb/s

so, overall, your download is doing what your connection can do, because from different servers you get parts of the pkg in the same time


The impossible missions are the only ones which succeed.

Offline

#11 2004-02-18 21:42:24

gumbo
Member
Registered: 2003-10-18
Posts: 24

Re: ideas for bandwith optimizing

i3839 wrote:

In other posts there was talk about binary diffs, is there any new info about that? Anyone tried it? Because if the diffs are really much smaller then that is a much bigger bandwidth saver than a lot other things.

I have some other projects to work on, but in a week or so I might be able to write something that trades RAM usage for disk space (temporarily).

Offline

#12 2004-02-19 17:06:51

rayzorblayde
Member
Registered: 2004-01-04
Posts: 67

Re: ideas for bandwith optimizing

dp wrote:

i think C) must be more clear said, sorry my english if i didnt manage to explain the first time

- it's not about continuing a broken download .. but it uses the same command for FTP that is already used and therefore downloads are continued already in pacman

C) is about this:

:: Retrieving packages from current...
kernel26-2.6.3-1:
archlinux.org           [XXXX##--XXXXXX] | 20kb/s
otherserver.some        [#---XXXXXXXXXX] | 30kb/s
other2.someelse         [XXXXXXXX###-XX] | 90kb/s
other3.someelse         [XXXXXXXXXXXX--] |  4kb/s

or just overall:

kernel26-2.6.3-1   [#---##--###---] | 144kb/s

so, overall, your download is doing what your connection can do, because from different servers you get parts of the pkg in the same time

so you basically want it to work like Bit Torrent?  I think that's a good idea myself.  it just has to make sure that the checksums are the same for each server it is downloading from.


loading.... please wait....

Offline

#13 2004-02-19 21:16:20

nifan
Member
Registered: 2003-04-10
Posts: 102

Re: ideas for bandwith optimizing

i dont think thats what dp is trying to say, what he's trying to make us see is: for the same file download diferent parts of that file from mutiple servers at the same time and then concatenate them all together in the original file.
at least that's what i think dp was trying to explain the first time ;P

p.s. my english is bad aswell dp, i think its getting worse btw neutral


______
"Ignorance, the root and the stem of every evil." - Plato

Offline

#14 2004-02-19 22:36:33

gumbo
Member
Registered: 2003-10-18
Posts: 24

Re: ideas for bandwith optimizing

This is what most download managers do.  It's a relatively well proven concept, all you have to do is issue another command that tells the server where to start sending bytes, and then close the connection when that section is finished.

Offline

#15 2004-02-20 00:56:22

dp
Member
From: Zürich, Switzerland
Registered: 2003-05-27
Posts: 3,378
Website

Re: ideas for bandwith optimizing

nifan wrote:

i dont think thats what dp is trying to say, what he's trying to make us see is: for the same file download diferent parts of that file from mutiple servers at the same time and then concatenate them all together in the original file.
at least that's what i think dp was trying to explain the first time ;P

thanx very much ... that's really clear written --- exactly


about the "Bit Torrent" i didnt know about ... now i searched and found : http://bitconjurer.org/BitTorrent/introduction.html
the idea is great, but often the users have much lower bandwidth in upload, so it is not that usefull for archlinux in my thinking --- nifan summarized my idea the best (see above quote)


The impossible missions are the only ones which succeed.

Offline

#16 2004-02-20 02:37:25

i3839
Member
Registered: 2004-02-04
Posts: 1,185

Re: ideas for bandwith optimizing

I was thinking about a daemon which does the uploading in the background. That way it doesn't matter that the upload is slow, because it will be spread out much more. To control the bandwidth usage it should have options like how much bandwidth to use each second, month and total (up/down ratio). Good default options may be 2/3 of the upload speed and in total as much as you download. The goal is to reduce the bandwidth cost of the Arch servers dramatically. The daemon will connect to a central server so that clients can find eachother. If there are not enough people available to get a decent download, the client will also download from the Arch server and/or mirrors. How more downloads, how less the servers will be loaded, because there are more clients which have the needed package. The advantage for the server is clear, the benefit for clients is that they can support Arch in an easy way, perhaps have faster downloads too, if there are not enough mirrors, and also at last do something useful with their upload.

Of course it isn't worth it to make a scalable package distribution system if bandwidth is not really a problem, e.g. enough people make a donation and there are enough mirrors, but it isn't too hard to make either. It will look a bit like Bittorrent (all distributed p2p systems do), only making such a system specific for packages is much simpler and can be optimized more. Although currently it may not be needed, it could help to save a lot bandwidth in the future. I'm willing to make it if there is enough demand for it.

Offline

#17 2004-02-20 09:17:47

Bobonov
Member
From: Roma - Italy
Registered: 2003-05-07
Posts: 295

Re: ideas for bandwith optimizing

So you are suggesting a kind of p2p arch....
The proposal sound interestig but I have some consideration:
1) I never and never run any kind of p2p (even pacmanP2P) on any of my servers for security reason.
2) maybe I can run it on my home computer.
3) I do not want pacman doing anything in automatic way, to often happen that there are package with some conflict or that require manual job to get working (like when php.ini changed position). I pacman do upgrade in automatic way then if something stop to work can be a pain in the ass to find why.

I think that to solve bandwith problem it is already ok that everybody switch from arch linux as primary server.
And everybody try to look around for some free mirror hosting service.
Many of us for sure is doing university and can try to check there.

Offline

#18 2004-02-20 10:54:49

dp
Member
From: Zürich, Switzerland
Registered: 2003-05-27
Posts: 3,378
Website

Re: ideas for bandwith optimizing

i3839 wrote:

I was thinking about a daemon which does the uploading in the background. That way it doesn't matter that the upload is slow, because it will be spread out much more. To control the bandwidth usage it should have options like how much bandwidth to use each second, month and total (up/down ratio). Good default options may be 2/3 of the upload speed and in total as much as you download. The goal is to reduce the bandwidth cost of the Arch servers dramatically. The daemon will connect to a central server so that clients can find eachother. If there are not enough people available to get a decent download, the client will also download from the Arch server and/or mirrors. How more downloads, how less the servers will be loaded, because there are more clients which have the needed package. The advantage for the server is clear, the benefit for clients is that they can support Arch in an easy way, perhaps have faster downloads too, if there are not enough mirrors, and also at last do something useful with their upload.

Of course it isn't worth it to make a scalable package distribution system if bandwidth is not really a problem, e.g. enough people make a donation and there are enough mirrors, but it isn't too hard to make either. It will look a bit like Bittorrent (all distributed p2p systems do), only making such a system specific for packages is much simpler and can be optimized more. Although currently it may not be needed, it could help to save a lot bandwidth in the future. I'm willing to make it if there is enough demand for it.

the idea is very interesting, but in my eyes these points must be solved first:

- consistensy: all users who offer upload must be up to date in regular basis

- there is a minimum of clients that should be running to have something like this working

- security: everyone can have pkgs on their own called the same like official ones (very dangerous - checksums<->masterchecksums (from archlinux.org) can maybe a solution)

- not all users have the same pkgs installed ... if e.g. a pkg is not that popular, it cannot be found in this network of user-computers

- update will take long because first a certain number of users have to mirror their repos (and also in a regular basis)

// well my idea was having multiple servers to download parallely --- the p2p-idea is another good one (that have to solve some points to be a really great one -> if e.g. 100'000 people are using arch, something like this may really be cool once it is running)


The impossible missions are the only ones which succeed.

Offline

#19 2004-02-20 13:50:37

i3839
Member
Registered: 2004-02-04
Posts: 1,185

Re: ideas for bandwith optimizing

The daemon will only serve files to other clients, it won't download and/or install any packages. See it as a sort of ftp server, but more advanced. The goal is to give some bandwidth, not to do any automatic package management on the local computer. Downloading from other clients will be done by a seperate program (I was thinking about making the download part of Pacman more modular so that it's easy to use something else than ftp, this way the client part of this p2p system can be easily implemented as a module for pacman).

It is possible to make an auto-upgrade daemon, but that isn't the goal and that would be a seperate program anyway. Personally if I would make an auto-upgrade daemon then I would make it so that you can configure which packages to auto-upgrade/install (with as special option "security upgrades only"). I wouldn't use anything more automatic than auto-downloading new packages. Although auto-install may look nice, I wouldn't trust it either.

About what dp said:

Consistency: The idea is that most clients will run the daemon, thus letting everyone be up to date all the time, even when they normaly wouldn't, will only cause extra bandwidth usage. All you need is enough daemons which have the package.

Yes, there is a minimum of users needed, although I don't know how many. I think it mainly depends on the users/number of packages ratio.

The system will be made with security in mind, making sure that the package is as it should be is one of the easiest things to do, thanks to checksums. Harder is to make it so that any form of abuse isn't possible (e.g. teasing clients by sending bad data).

As the goal is to give bandwidth back, and not to have a totally fair system, it doesn't matter that people download from you while they don't run a daemon too. Downloading from daemons helps to reduce the load on the Arch servers, so it's not like you're leaching.

If a package isn't that popular then it also won't cause much bandwidth usage for the Arch servers, thus downloading such packages directly from the Arch servers isn't bad.

Users don't have to mirror their repos, they only serve packages they have. This system isn't made to be totally self containing, updating won't take long because the first people downloading an updated package download it from the Arch servers. It can happen that you are downloading half the package from other clients, and the other half from an Arch server. But how more clients downloaded the new package, the better this system will work. So after a new package is released, the server will get less and less hit for that package. Basically, the more the server will be loaded normally, the better this p2p system works, thus avoids bandwidth usage when it's needed the most.

The goal is to reduce bandwidth usage for the Arch servers, so it doesn't matter if people need to download packages from the Arch server sometimes, as long as the packages that cause the 90% of all bandwidth usage are downloaded from other clients.

Downloading from multiple server in parallel is a nice idea, but it's nice for the client, it will still cause the same amount of bandwidth usage for the Arch servers. The p2p idea is nice for the servers, but for the individual client it doesn't matter much, so it should be as easy as possible to use. The hardest part is to manage all the clients so that they can find eachother.

It would be nice to have some statistics about how many Arch users there are and about packages (how often they are downloaded, in what timeframe, etc).

Offline

Board footer

Powered by FluxBB