You are not logged in.
Pages: 1
has anybody ever run any numbers to find the threshold number of users and average package usage to justify a local mirror?
we have about 7 desktop boxes using x and 6 servers.
for myself, i have install 238 packages coming in at a svelt 428M for /var/cache/pacman/pkg after running pacman -Sc. that's probably about the average for the desktops. our servers average about 85 with the cache running about 100M.
the problem is that that's simply a snapshot, the actually volatility of those packages and the average size per time period is lost.
we're currently mirroring testing,current and extra.
at first we thought this was a great a idea until one of my co-workers suggested that running a debian mirror over at the local university they found the benefit questionable. of course, debian is a _very_ large creature, and 3 architectures on unstable may have accouted for that.
for our part, the only wasteful things i can think we're mirroring is kde and openoffice.
what might actually help in decision making would be a package/size change list over certain time periods. like hourly or daily. anybody seen anything like this?
probably going to try ntop on this box, but thought i would solicit over experiences first.
any thoughts?
thanks,
jp
Offline
Why not mirror? Is there some downside for you as far as storage goes?
If for no other reason, it reduces the strain on the arch server, and other mirrors, so it is a good thing.
"Be conservative in what you send; be liberal in what you accept." -- Postel's Law
"tacos" -- Cactus' Law
"t̥͍͎̪̪͗a̴̻̩͈͚ͨc̠o̩̙͈ͫͅs͙͎̙͊ ͔͇̫̜t͎̳̀a̜̞̗ͩc̗͍͚o̲̯̿s̖̣̤̙͌ ̖̜̈ț̰̫͓ạ̪͖̳c̲͎͕̰̯̃̈o͉ͅs̪ͪ ̜̻̖̜͕" -- -̖͚̫̙̓-̺̠͇ͤ̃ ̜̪̜ͯZ͔̗̭̞ͪA̝͈̙͖̩L͉̠̺͓G̙̞̦͖O̳̗͍
Offline
the storage doesn't bother us at all, but there does appear to be a threshold where mirroring actually uses up more bandwidth than not. this is the situation that my debian 'friend' had :-).
for instance, say arch had about 50mb per day of updates for all packages, but of all packages, only 3mb was generally of interest per system, with 15 systems, thats 45mb. at that point, the threshold value of mirroring is negative.
this is just an example of where that threshold might be, i don't know yet and don't have any numbers, so i was curious if anyone else had looked at this problem.
-edit, btw, it certainly reduces our in-house bandwidth, the question is when it reduces arch's bandwidth
Offline
hmm..i see.
well, it wouldn't be too hard to have rsync exclude the files you don't need. like (kde*, gnome*, etc.)
then you could have abs on a sever, and have a script go through and remove the things that you do not have in your mirror'ed repo, then run gensync locally.
It would take a little while writing a script to automate it, but it shouldn't be too hard.
"Be conservative in what you send; be liberal in what you accept." -- Postel's Law
"tacos" -- Cactus' Law
"t̥͍͎̪̪͗a̴̻̩͈͚ͨc̠o̩̙͈ͫͅs͙͎̙͊ ͔͇̫̜t͎̳̀a̜̞̗ͩc̗͍͚o̲̯̿s̖̣̤̙͌ ̖̜̈ț̰̫͓ạ̪͖̳c̲͎͕̰̯̃̈o͉ͅs̪ͪ ̜̻̖̜͕" -- -̖͚̫̙̓-̺̠͇ͤ̃ ̜̪̜ͯZ͔̗̭̞ͪA̝͈̙͖̩L͉̠̺͓G̙̞̦͖O̳̗͍
Offline
yeah, we'ld talked about that, but with 8 or so different users, trying to restrict the packages is _very_ tricky. i know one user in particular uses gnome, but has from time to time used k3b, which has some kde deps that we probably woundn't have brought down otherwise.
it would be significantly easier to simply run the mirroring rsynced, but we won't do it if in the end it is wasteful.
we could probably even do some rsync tricks to ignore certain files, if pacman could know enough to fail over to the next mirror if a declared package isn't shown locally.
in this case, we could rsync to filter uncommon things like openoffice or kde stuff for instance, and if someone did go for k3b, which would be listed in the packages, it could see we didn't have it and try the next mirror.
actually, that might be a nice feature, i've had that situation once or twice when i've gone to a mirror and a file that was advertised was not there.
Offline
I understand your situation seems more office or possibly work related but I have found it invaluable to rsync the Arch repos. The trick is I could never justy the use of the bandwidth myself, but, I host lan parties and do work for people (repair, IT, and other such stuff). I have converted many over to Arch atleast on a trial run setup.
The trick is, I use dvd-rw's, this way I can put the repos on dvd and hand them out to the people that I have converted or have setup for trial runs. This way if there was a problem with a package that cropped up I know about it ahead of time. I wait till a fix has been implemented and then once a month I fetch the dvd-rw's re-burn the repos on them, then redistribute the dvd's. Imo, this saves much bandwidth so people are not constantly hitting the arch repos servers.
I also hold lan parties that are 100% Linux compat game wise. During these events I bring my repos machine and help people out setting up Arch if they so desire. This greatly boosts install times and keeps the people away from the arch repos servers, then I give them the option if they want my distributed dvd thing I do.
Like I said, I could never justify rsync'ing the repos for my own benefit, but for what I do, it greatly helps me and everyone else out I think.
For 7 desktops and 6 servers I could see it justified in an office business setting because it can greatly help reduce your cost and help speed up the time of delivery overall. If you feel you are wasting bandwidth there is the donate to Arch thingy here to help pay for there server cost.
I should prolly take my own advice on that last line.
I should also mention that I only do an rsync once afery 2-4 weeks, as most things don't need to be updated asap unless it is a major security updated packge or something major changed like moving to udev or xorg. Havn't had any more of those recently though lately.
Offline
unfortunately at least 2 of us :oops: are obsessive compulsive and update at least daily. sometimes more.
i'm going too have to look into some form of lazy mirroring. where requested/downloaded packages are the only ones kept in the mirror.
Offline
I didn't read the whole thread in detail, but this is how I'd do it:
Have one main box hosting the packages.
Let all other comps mount the main box's package dir with NFS on their /var/cache/pacman/pkg/ directory.
That way if someone does a sync it will be fetched from the cache, or if it isn't there it will be added to the cache and all Arch boxes in the network can use it.
EDIT:
You'd probably want the pacman databases to be on the main server too and letting everyone use that.
Offline
I am doing something similar to what you want because I only have internet access to the univ. websites/servers . The refered files are at http://www.astro.umontreal.ca/~belanger/repo/ if you want to check them.
Here are the steps:
1. On university server, I run db.sh to get new database.
2. On local computer, I run pacman -Syup to get a list of URL of the new packages.
3. I paste that list to a pkg file (on university server)
4. On university server, I run ul.sh to download packages from official arch mirors
5. On local computer, I run pacman -Su to download packages from university server to local computer.
That way only the packages I need are downloaded from official arch mirors. I don't know how well that can be automated to several computer. I suppose that each computer could sumbit his list of requested packages to the local mirror server where they can be concatenated and duplicates removed.
Offline
Snowman:
that's a pretty neat way to do it. but like you said, managing that for a couple of computers might be a bit much. i'll have to look at that a bit closer.
i3839:
that would be ideal, except we all use laptops, except for the servers of course. the servers can just nfs mount their caches to a single location and that would be perfect. for the laptops we'ld have to find a way to make that mount accessible outside our firewall and then that would work.
Offline
Personally, I like i3839's idea. As for making it accessible outside the firewall, I don't know how much of a security hazard that is, I googled a bit..so I'll just toss some links here:
http://www.coda.cs.cmu.edu/ -same idea as NFS, advertises encryption too
http://www.math.ualberta.ca/imaging/snfs/ -I don't know how easy this is to get going, but it sounds like a very secure idea
if you put up a firewall on the machine, ideas out of here can work: http://nfs.sourceforge.net/nfs-howto/security.html (section 6.4)
I think it's quite feasable, and good luck
EDIT: actually, check out section 6.5 of the last link, it's got pretty good instructions on tunneling it through ssh, which I still think is the easiest and most secure solution.
The suggestion box only accepts patches.
Offline
okay, so the feature i was looking for in pacman already exists, which i swear it didn't. but after testing a bit more, i see that i was wrong.
i thought i got this error if you setup your mirrors files and a package was missing. say i rename bind to xbind on our mirror and try to sync it with my pacman.conf just directly listing our mirror:
[root@aragorn ~]# pacman -S bind
Targets: bind-9.3.1-1
Total Package Size: 1.9 MB
Proceed with upgrade? [Y/n] y
:: Retrieving packages from current...
failed downloading /archlinux.org/current/os/i686/bind-9.3.1-1.pkg.tar.gz from beethoven.3yd.com: HTTP/1.1 404 Not Found
error: failed to retrieve some files from current
but when i go back and simply put our mirror file, and try with bind still renamed i get this:
[root@aragorn ~]# pacman -Syu
:: Synchronizing package databases...
testing [##############################################] 100% 30K 759.6K/s 00:00:00
current [##############################################] 100% 44K 1236K/s 00:00:00
extra [##############################################] 100% 195K 1146K/s 00:00:00
[root@aragorn ~]# pacman -S bind
Targets: bind-9.3.1-1
Total Package Size: 1.9 MB
Proceed with upgrade? [Y/n] y
:: Retrieving packages from current...
failed downloading /archlinux.org/current/os/i686/bind-9.3.1-1.pkg.tar.gz from beethoven.3yd.com: HTTP/1.1 404 Not Found
bind-9.3.1-1 [##### ] 11% 214K 76.6K/s 00:00:24
that meant that that single mirror didn't have the file. but after just testing pacman does appear to try the next mirror in the list.
so it appears that my solution is simpler that i could have hoped.
i can run rsync filtering out packages (include filter generated from multi-package lists from above, thanks snowman), we setup our mirror files properly and if someone wants something thats not listed, then they have to go to their next mirror to get it.
thanks again everyone for the comments and good ideas. i'm still going to look into a few of the other ideas. the encrypted nfs mount is of other interests at this point. i may still setup a noauto nfs mount for our cache that is the repo for our laptops that allows us to quickly mount an existing cache when here, and just use our own when not.
jp
Offline
If I were you I'd want nonexisting packages to be downloaded onto the server when it isn't there yet, so other can use it too, thus downloading only the packages needed without need to keep a list of whatever.
I'd use the XferCommand option pacman.conf which calls a script that does two things:
First it executes a script on the server with ssh.
Then it uses scp to download the package from the server.
The script on the server checks if the package is in the cache, and if that isn't the case it downloads the package.
Depending on the server's connection speed, the downloading can happen parallel after some tweaking with the scripts.
Offline
that's a good idea. one of our guys was going to do something like that but from the server side instead, with apache. that would alieviate the need to for anything more than using the mirror to get it updated. for the moment, i've simply generated rsync include lists from the package lists of a couple of example users on a few systems.
but the hope is to have an apache plugin to accept incoming requests, and fire off a python script to retrieve missing files and download to the system and update the include script.
we'll keep posting any progress here.
Offline
basically our first step landed us this....
we our mirror dir /var/archlinux.org/
we have a cron script that wraps our sync script, our sync looks like this:
#!/bin/bash
MIRROR=/var/archlinux.org
go(){
grep -h . $MIRROR/includes/* | awk {'print $1"-*"'} | rsync -avz --delete --delete-excluded --include-from - --exclude =os/i686/*.pkg.tar.gz archlinux.org::$1 $MIRROR/$2
}
go "current" "current/"
go "extra" "extra/"
go "ftp/testing/" "testing/"
#go "ftp/unstable/" "unstable/"
the files that exist in the subdir includes are basically generated from each interested use running:
pacman -Q | awk {'print $1'}
this creates our includes list of things to bring down and keep rsync'd.
this has brought down our server space from 2.8G to 850M. not that drive space was an issue, but that 2Gs wasted download on the server. and the problem of managing which files to bring down becomes a user problem.
we're hoping by the end of the week to have the update that lets our mirror proxy to the arch server to grab a missing file and add it to our repo and sync list. then it's zero maintaince and minimal bandwidth. if we simply relay to the arch served package db files, we wouldn't even need to rsync, simply let the first user that wants a package do the waiting.
Offline
I have 4 servers and a number of workstations. I use squid and frox proxies, and tell pacman to use wget. wget is setup to go thru the proxies. The proxies are configured to cache and keep large files (by default they dont).
That way, the first time the file is downloaded it is cached locally, and all other systems will get it locally. Eventually it will get replaced by newer files.
I toyed with the idea of mirroring the repos, but since I needed a webproxy anyway, I figured it would work better this way.
Offline
Wow, great progress, I am gonna try and get something similar setup for myself now. Lots of great tips in here to really streamline the DL'ing. For what I do I'll still need more packages than what I'll use for the other people but this could really help me speed up some things.
Thanks for all the info from the poster aswell on letting us know what is involved.
Offline
Pages: 1