You are not logged in.

#1 2009-12-20 00:46:44

extofme
Member
From: here + now
Registered: 2009-10-10
Posts: 174
Website

pacproxy (or something that vaguely resembles an apt-proxy clone)

this may not be the right forum for this, but here is a little python app to proxy packages from a mirror, and (eventually) to automatically create repos from available packages in the local ABS tree.  i didnt like the suggested solution of network mounting /var/cache/pacman/pkg/, and i wanted my ABS built packages to autoupdate without running a repo-add manually all the time... and i detest cron jobs.

as started, this started as a project to automatically create repos from any available binary packages existing within the ABS tree; i havent quite finished that yet.  i am using the proxy part for 4 arch machines in my home and it seems to be doing pretty good.  when the ABS stuff is done it will behave like this:

/var/abs/<repo_name>/..../..../{pkg/,src/}

where any packages in directory <repo_name> will be advertised as being a part of a repo with the same name (the <repo_name>.db.tar.gz file will be dynamically created and cached, this is the part im not done with).  it wont matter how deep the pkg file is, and the architecture will be automatically accounted for by reading the .PKGINFO file.

right now though, proxying from ONE mirror (will probably add support for a mirror list like in pacman.d, but im not sure how this is handled exactly, any info on that would be great) seems to work pretty good, and it will proxy both architectures.  as is, it will store pakages and a small "cache" file in .pacproxy/<repo_name>/<arch>/.  the cache file has the same name as the package, and simply holds the Etag or Last-Modified header from when the package was pulled from the mirror.  everytime a file is requested, a HEAD request is sent to the mirror using that information, and if a 304 (not-modified) is returned, the cached copy is used, else the new copy is pulled and the cache file updated.  looks something like this:

[cr@extOFme-d0 ~]$ tree .pacproxy
.pacproxy                        
|-- community                    
|   |-- i686                     
|   |   |-- community.db.tar.gz  
|   |   `-- community.db.tar.gz.cache
|   `-- x86_64
|       |-- community.db.tar.gz
|       `-- community.db.tar.gz.cache
|-- core
|   |-- i686
|   |   |-- core.db.tar.gz
|   |   |-- core.db.tar.gz.cache
|   |   |-- coreutils-8.2-1-i686.pkg.tar.gz
|   |   |-- coreutils-8.2-1-i686.pkg.tar.gz.cache
|   |   |-- filesystem-2009.11-1-any.pkg.tar.gz
|   |   |-- filesystem-2009.11-1-any.pkg.tar.gz.cache
|   |   |-- glib2-2.22.3-1-i686.pkg.tar.gz
|   |   `-- glib2-2.22.3-1-i686.pkg.tar.gz.cache
|   `-- x86_64
|       |-- core.db.tar.gz
|       `-- core.db.tar.gz.cache
`-- extra
    |-- i686
    |   |-- boost-1.41.0-2-i686.pkg.tar.gz
    |   |-- boost-1.41.0-2-i686.pkg.tar.gz.cache
    |   |-- extra.db.tar.gz
    |   |-- extra.db.tar.gz.cache
    |   |-- xdg-utils-1.0.2.20091216-1-any.pkg.tar.gz
    |   |-- xdg-utils-1.0.2.20091216-1-any.pkg.tar.gz.cache
    |   |-- xf86-input-synaptics-1.2.1-1-i686.pkg.tar.gz
    |   |-- xf86-input-synaptics-1.2.1-1-i686.pkg.tar.gz.cache
    |   |-- xulrunner-1.9.1.6-1-i686.pkg.tar.gz
    |   `-- xulrunner-1.9.1.6-1-i686.pkg.tar.gz.cache
    `-- x86_64
        |-- extra.db.tar.gz
        `-- extra.db.tar.gz.cache

i am still relatively new to the python scene, and i know there are several optimizations and probably alot of refactoring that will happen before im satisfied with it, but i think it is useful enough at this point to release to everyone here.  any ideas are very welcome, and once i finally get my server back to the datacenter, ill host this (in git) on extof.me along with some other goodies TBA at a later date :).  see POSSIBLE CAVEATS and DEVELOPMENT for ideas as to where im going and some issues that are definately present right now.

DEPENDENCIES

$ pacman -S cherrypy

HOW TO USE

...point pacman.conf to it (port 8080 by default)...
Server = http://localhost:8080/archlinux/$repo/os/x86_64
...edit pacproxy.py and change "mirrors" to an appropriate one for you (use $arch variable!)...
mirrors = {'mirrors.gigenet.com': '/archlinux/$repo/os/$arch'}
$ python pacproxy.py

POSSIBLE CAVEATS

1) multiple requests from multiple machines at the same time will probably cause some problems right now, as the cache's state will be inconsistent.  i *think* concurrent tools like powerpill will still work correctly from the same machine, since its not pulling the same packages twice, and cherrypy will handle the threading.
2) im pretty sure the caching stuff is working correctly, but im not sure if pacman is realizing that its real copy (specifically the db.tar.gz files) is up to date.  i may need to send some additional headers
3) if its not obvious, this only proxies http requests, not ftp
4) there is no cache cleaning in .pacproxy, as files become out of date, they will just stay there and take up space.  not a huge deal, im just not sure how to address this, maybe remove them after X days of not being accessed
5) there is some security issues with eval'ing the .cache file (its just a dict), should maybe do that differently
6) im sure there are many other problems and security flaws, ill list/remove them as the show up/are fixed

DEVELOPMENT

1) anyone looking to mess with this (please do!), you can use cherrypy.log(msg) to send stuff to the log file (stdout unless you've changed it)
2) fix some concurrency issues by stalling one thread's download of a pkg until another thread has finished writing the pkg to the cache (hopefully lockfiles+timeouts will take care of this)
3) finish the ABS autobuilder, and maybe look into pulling in packages from another machines ABS tree using ssh/paramiko or similar (that would be cool)
4) make the code cleaner and avoid some of the duplication, move to multiple files (cherrypy supports auotmatically monitoring modules for changes then reloading them) otherwise changes to the main file causes a reload of the entire server which could interrupt downloads.  plus right now its pretty much one huge function
5) python/cherrypy isnt the most efficient way to deliver a large file, maybe there is a way to use python for the logic and a better server to actually read out the file to the client?
6) probably dont need to ping the server for updates to pkg files... the name of the file *should* change when the file's updated.  this was mainly for ABS derived packages
7) shower me with ideas!

and now some code!

PACPROXY.PY

import os, fnmatch, httplib, cherrypy

# we use this to know what to skip in local_dbs
abs_standard = ['core','extra','community','community-testing']

# what dbs should we proxy, and which are locally derived from custom ABS directories
# /var/abs/[db_name]/ are local dbs                                                  
proxy_dbs = abs_standard                                                             
local_dbs = [p for p in os.listdir('/var/abs') if os.path.isdir('/var/abs/' + p) and p not in abs_standard]

mirrors = {'mirrors.gigenet.com': '/archlinux/$repo/os/$arch'}
valid_arch = ['i686', 'x86_64', 'any']                        

# we'll put stuff here
cache_root = os.getenv('HOME') + '/.pacproxy'

def locate_pkgs(pattern):
    top_exclude_dirs = ['core','extra','community','community-testing']
    rel_exclude_dirs = ['pkg','src']                                   
    for path, dirs, files in os.walk('/var/abs'):                      
        # no reason to look thru folders provided by ABS               
        if path=='/var/abs':                                           
            for d in [dir for dir in dirs if dir in top_exclude_dirs]: dirs.remove(d)
        # or folders created by makepkg                                              
        for d in [dir for dir in dirs if dir in rel_exclude_dirs]: dirs.remove(d)    
        for filename in fnmatch.filter(files, pattern):                              
            yield os.path.join(path, filename)                                       

def gen_proxy(remote_fd, local_fd=None):
    def read_chunks(fd, chunk=1024):    
        while True:                     
            bytes = fd.read(chunk)      
            if not bytes: break         
            yield bytes                 
    for bytes in read_chunks(remote_fd):
        if local_fd is not None:        
            local_fd.write(bytes)       
        yield bytes                     

def serve_repository(repo, future, arch, target):
    # couple sanity checks                       
    if arch not in valid_arch or repo not in proxy_dbs + local_dbs:
        raise cherrypy.HTTPError(404)                              
    is_db = fnmatch.fnmatch(target, repo + '.db.tar.gz')           
    is_pkg = fnmatch.fnmatch(target, '*.pkg.tar.gz')               
    is_proxy = repo in proxy_dbs                                   
    is_local = repo in local_dbs                                   
    if not any((is_db, is_pkg)) or not any((is_proxy, is_local)):  
        raise cherrypy.HTTPError(404)                              
    active_mirror = mirrors.iterkeys().next()                      
    remote_target = '/'.join([mirrors[active_mirror].replace('$repo', repo).replace('$arch', arch), target])
    remote_file = 'http://' + '/'.join([active_mirror, remote_target])                                      
    local_file = '/'.join([cache_root, repo, arch, target])                                                 
    cache_dir = os.path.dirname(local_file)                                                                 
    if not os.path.exists(cache_dir): os.makedirs(cache_dir)                                                
    # find out if there is a cached copy, and if its still good                                             
    if is_proxy:                                                                                            
        if os.path.exists(local_file) and os.path.exists(local_file + '.cache'):                            
            cache = eval(open(local_file + '.cache').read())                                                
            req = httplib.HTTPConnection(active_mirror)                                                     
            req.request('HEAD', remote_target, headers=cache)                                               
            res = req.getresponse()                                                                         
            if res.status==304:                                                                             
                remote_fd = open(local_file, 'rb')                                                          
                local_fd = None                                                                             
            elif res.status==200:                                                                           
                map(os.unlink, [local_file, local_file + '.cache'])                                         
                etag = res.getheader('etag')                                                                
                last_mod = res.getheader('last-modified')
                cache_dict = {}
                if etag is not None:
                    # try etag first
                    cache_dict['If-None-Match'] = etag
                elif last_mod is not None:
                    cache_dict['If-Modified-Since'] = last_mod
                if len(cache_dict)>0:
                    cache_fd = open(local_file + '.cache', 'wb')
                    cache_fd.write(repr(cache_dict))
                    cache_fd.close()
                req2 = httplib.HTTPConnection(active_mirror)
                req2.request('GET', remote_target)
                remote_fd = req2.getresponse()
                local_fd = open(local_file, 'wb')
            else:
                raise cherrypy.HTTPError(res.status)
        else:
            if os.path.exists(local_file): os.unlink(local_file)
            if os.path.exists(local_file + '.cache'): os.unlink(local_file + '.cache')
            req = httplib.HTTPConnection(active_mirror)
            req.request('GET', remote_target)
            remote_fd = req.getresponse()
            if remote_fd.status!=200:
                raise cherrypy.HTTPError(remote_fd.status)
            local_fd = open(local_file, 'wb')
            etag = remote_fd.getheader('etag')
            last_mod = remote_fd.getheader('last-modified')
            cache_dict = {}
            if etag is not None:
                # try etag first
                cache_dict['If-None-Match'] = etag
            elif last_mod is not None:
                cache_dict['If-Modified-Since'] = last_mod
            if len(cache_dict)>0:
                cache_fd = open(local_file + '.cache', 'wb')
                cache_fd.write(repr(cache_dict))
                cache_fd.close()
    cherrypy.response.headers['Content-Type'] = 'application/octet-stream'
    if repo in proxy_dbs:
        return gen_proxy(remote_fd, local_fd)
    if repo in local_dbs:
        pass
    # nothing seems valid? throw a 404
    raise cherrypy.HTTPError(404)
serve_repository.exposed = True

conf = {'server.socket_host': '0.0.0.0',
        'server.socket_port': 8080,
        'request.show_tracebacks': False}

cherrypy.config.update(conf)
cherrypy.quickstart(serve_repository,'/archlinux')

Last edited by extofme (2009-12-20 02:57:41)


what am i but an extension of you?

Offline

#2 2010-01-23 16:58:52

erick.red
Member
From: Miami, FL, USA
Registered: 2007-06-16
Posts: 41
Website

Re: pacproxy (or something that vaguely resembles an apt-proxy clone)

Again ???, you're using this script for when one of your machine request a package you can cache it, so if the other does there's no need to go online to get the package is already on some machine you designed 'server' ????

Is that what this is for ?


Carpe diem, quam minimum credula postero

Offline

#3 2010-01-23 20:17:07

extofme
Member
From: here + now
Registered: 2009-10-10
Posts: 174
Website

Re: pacproxy (or something that vaguely resembles an apt-proxy clone)

Yes my goal is to eventually make this smart enough to point all my machines at it permanently.  I want it to automatically manage packages from my ABS tree into a usable repository.  Also, I intend to use something like this on my server with LXC containers so all containers can look to the proxy for packages (which will have custom builds for some packages)

I may want it to provide a differet package set depending on whose asking (by ip address)

I may want the machine to think there are more packs in core/etc than there really is.

I started the script so I could build custom arch live CDs with larch and do trick repository related things, although i haven't worked on this much lately due to other projects that I will be detailing soon regarding LXC and BTRFS.  I will come back to this soon, but it may get incorporated into a more ambitious project that I've begun working on... A distributed, peer to peer aur/pacman replacement.

Interesting things to come...


what am i but an extension of you?

Offline

#4 2010-01-23 21:11:09

tomk
Forum Fellow
From: Ireland
Registered: 2004-07-21
Posts: 9,839

Re: pacproxy (or something that vaguely resembles an apt-proxy clone)

Keep up the good work. smile

Moved to Community Contributions.

Offline

#5 2010-05-19 20:53:38

oskude
Member
From: Germany
Registered: 2010-05-19
Posts: 15
Website

Re: pacproxy (or something that vaguely resembles an apt-proxy clone)

hi,

this sounds lovely!
heres my quick test

host (where pacproxy runs) 192.168.0.35
client (vbox bridged) 192.168.0.20

- copy and pasted the code to pacproxy.py on the host
- modified pacproxy.py to have

mirrors = {'ftp.hosteurope.de' : '/mirror/ftp.archlinux.org/$repo/os/$arch'}

(hope thats right, not the same convention as your example)
- boot up arch iso (2010.04.05) in vbox client
- in arch setup i choose source > net > mirror > custom:

http://192.168.0.35:8080/archlinux/$repo/os/i686

(doesnt complain after this, i assume its okay or doesnt care)
- make partitions etc...
- selecting packages works
- and installing packages shows many errors here and there (failed retrieving file...) and after ~10secs stays at downloading mailx....
- and after some time in pacproxy output i get 192.168.0.20 - - [19/May/2010:22:19:31] "GET /archlinux/core/os/i686/perl-5.10.1-5-i686.pkg.tar.gz HTTP/1.1" 200 15405894 "" "pacman/3.3.3 (Linux i686) libalpm/4.0.3" over and over again...
- but it did get some packages to ~/.pacproxy/core/i686/ on the host (and according to the install log, on the client too)

please tell me if my configuration is ok and what debug info would help you ? or any other ways i could help

cheers
.andre

Offline

#6 2010-05-19 21:01:16

extofme
Member
From: here + now
Registered: 2009-10-10
Posts: 174
Website

Re: pacproxy (or something that vaguely resembles an apt-proxy clone)

well your config looks ok, the $repo/$arch stuff should just look the same as is does in your pacman.conf or mirrors.conf file; it will vary for each mirror.

yeah the timeout problem i was getting too... im not sure what the actual problem is there.  i should have used the built in http server instead of cherrypy for starters... in my case i just kept trying until it got all the packages it needed.

i do intend to make this script pretty nice eventually, but there really isnt any debug/etc. at the moment.


what am i but an extension of you?

Offline

#7 2010-05-19 21:05:11

extofme
Member
From: here + now
Registered: 2009-10-10
Posts: 174
Website

Re: pacproxy (or something that vaguely resembles an apt-proxy clone)

i _believe_ the timeouts are happening due to too much time between caching and sending to the client.  what i don't understand is WHY... there is suppose to be zero time, as the cache file is written simultaneously with delivery to client:

def gen_proxy(remote_fd, local_fd=None):
    def read_chunks(fd, chunk=1024):    
        while True:                     
            bytes = fd.read(chunk)      
            if not bytes: break         
            yield bytes                 
    for bytes in read_chunks(remote_fd):
        if local_fd is not None:        
            local_fd.write(bytes)       
        yield bytes

for each MB the proxy receives from a real mirror, the MB is delivered to the client, AND written to disk... there should be no delay.


what am i but an extension of you?

Offline

#8 2010-05-20 12:44:43

oskude
Member
From: Germany
Registered: 2010-05-19
Posts: 15
Website

Re: pacproxy (or something that vaguely resembles an apt-proxy clone)

could the new .tar.xz be the issue ?
http://www.archlinux.org/news/490/
as i only see .tar.gz in the code, and it only downloads .tar.gz files...

i did try to change the two lines in pacproxy to use .tar.xz but then i dont get the core.db.tar.gz and cant pass to package selection in the installer...

and now looking at the code more, why it's reading /var/abs ?
(i mean does it mess up here, as ii only got README and empty local dir there)

Offline

#9 2010-05-20 12:47:06

oskude
Member
From: Germany
Registered: 2010-05-19
Posts: 15
Website

Re: pacproxy (or something that vaguely resembles an apt-proxy clone)

oskude wrote:

and now looking at the code more, why it's reading /var/abs ?
(i mean does it mess up here, as ii only got README and empty local dir there)

forget that, if i read it right, the function is just defined, but not used

Offline

#10 2010-05-20 12:55:06

oskude
Member
From: Germany
Registered: 2010-05-19
Posts: 15
Website

Re: pacproxy (or something that vaguely resembles an apt-proxy clone)

yeah, i should have read the code more, not just search/replace .gz to .xz smile

this seems to solve the errors in installer, but i assume it gets timeout on second package, tzdata. just hangs there. or third, as pacproxy repeats glibc over and over

    is_db = fnmatch.fnmatch(target, repo + '.db.tar.gz')
    is_pkg = fnmatch.fnmatch(target, '*.pkg.tar.xz')

Offline

#11 2010-05-20 14:04:06

oskude
Member
From: Germany
Registered: 2010-05-19
Posts: 15
Website

Re: pacproxy (or something that vaguely resembles an apt-proxy clone)

well, i didnt get it to pass downloading the first to packages and then hanging (and pacproxy looping glibc package), even on subsequent restart (so that pacproxy would already have those cached), but i did found out an aif profile for this smile

# aif -p automatic -c pacinst.aif

SOURCE=net
SYNC_URL=http://192.168.0.35:8080/archlinux/core/os/i686
HARDWARECLOCK=localtime
TIMEZONE=Europe/Berlin
# Do you want to have additional pacman repositories or packages available at runtime (during installation)?
# RUNTIME_REPOSITORIES = array like this ('name1' 'location of repo 1' ['name2' 'location of repo2',..])
RUNTIME_REPOSITORIES=
# space separated list
RUNTIME_PACKAGES=

# packages to install
TARGET_GROUPS=base       # all packages in this group will be installed (defaults to base if no group and no packages are specified)
TARGET_PACKAGES_EXCLUDE= # Exclude these packages if they are member of one of the groups in TARGET_GROUPS.  example: 'nano reiserfsprogs' (they are in base)
TARGET_PACKAGES=  # you can also specify separate packages to install (this is empty by default)

# you can optionally also override some functions...
#worker_intro () {
    #bug ? following gives: inform command not found
    #inform "Automatic procedure running the generic-install-on-sda example config.  THIS WILL ERASE AND OVERWRITE YOUR /DEV/SDA.  IF YOU DO NOT WANT THIS PRESS CTRL+C WITHIN 10 SECONDS"
    #sleep 10
#}

worker_configure_system () {
    prefill_configs
    sed -i 's/^HOSTNAME="myhost"/HOSTNAME="arch-generic-install"/' $var_TARGET_DIR/etc/rc.conf
}


# These variables are mandatory

GRUB_DEVICE=/dev/sda
PARTITIONS='/dev/sda *:ext3'
BLOCKDATA='/dev/sda1 raw no_label ext3;yes;/;target;no_opts;no_label;no_params'

Offline

#12 2010-05-20 14:11:20

oskude
Member
From: Germany
Registered: 2010-05-19
Posts: 15
Website

Re: pacproxy (or something that vaguely resembles an apt-proxy clone)

extofme wrote:

there is suppose to be zero time, as the cache file is written simultaneously with delivery to client

hmm, maybe we could try not to send anything to the installer before we have the whole file cached... but maybe theres a timeout in the installer that gets hit if it waits too long for a response for a package request, so this could fail too... just blindly thinking loud, not actually a real clue smile

and on the same breath, maybe pacproxy doesnt give the request 1:1 for the client (like missing response headers or something) and the client doest know the filesize, or something...

ps. i hang on #archlinux 13:00 ~ 23:00 GMT+2 if you want more loud thinking wink

Last edited by oskude (2010-05-20 14:15:32)

Offline

#13 2010-05-20 20:56:35

extofme
Member
From: here + now
Registered: 2009-10-10
Posts: 174
Website

Re: pacproxy (or something that vaguely resembles an apt-proxy clone)

ah yes the xz compression would cause a problem, the fnmatch() function could be something like:

is_pkg = fnmatch.fnmatch(target, '*.pkg.tar.[gx]z')

i believe is supports character classes.  as for waiting for the cache file to be downloaded... that was how i first had it, but MANY packages time out since the client receives nothing for a long time (a kernel download could take forever before the client gets it).  i think i just need to drop cherrypy and use some lower level httplib stuff; cherrypy supports streaming by using a generator, thats what i was trying to do.

it probably is missing the header for filesize (proxy is acting as a reflector, but i'm only setting the content-type header), it _shouldn't_ really matter, but maybe it is.  bleh, i thought it would work out of the box for ya, sorry guy.


what am i but an extension of you?

Offline

#14 2010-05-20 21:19:46

Daenyth
Forum Fellow
From: Boston, MA
Registered: 2008-02-24
Posts: 1,244

Re: pacproxy (or something that vaguely resembles an apt-proxy clone)

It needs to be '*.pkg.tar.*', as that's what's supported.

Offline

#15 2010-05-22 00:20:18

oskude
Member
From: Germany
Registered: 2010-05-19
Posts: 15
Website

Re: pacproxy (or something that vaguely resembles an apt-proxy clone)

good news! big_smile
pacproxy.py works in AIF! cool
http://mailman.archlinux.org/pipermail/ … 13426.html

Offline

#16 2010-05-23 16:38:08

extofme
Member
From: here + now
Registered: 2009-10-10
Posts: 174
Website

Re: pacproxy (or something that vaguely resembles an apt-proxy clone)

Daenyth wrote:

It needs to be '*.pkg.tar.*', as that's what's supported.

$ sudo pacman -U pyjamas-svn-2416-1-i686.pkg.tar.iswearthisisapackage.butareyousure.itsapackage 
Password: 
loading package data...
checking dependencies...
(1/1) checking for file conflicts                   [##########] 100%
(1/1) upgrading pyjamas-svn                     [##########] 100%

perhaps, because pacman is guessing; when it comes to globs and regex.... Beware The Asterisk.  pacman supports 2 compression schemes, [xg]z... not Infinite, which is what "*" implies.  either works in this case, but i always code toward minimum uncertainty for fewer bugs :-)

i am going to adapt this script into a GAE repository reflector... i want to get enough people running GAE repo apps that official mirrors become obsolete.  we might be able to flurry download and sync the GAE apps for incredible performance.


what am i but an extension of you?

Offline

#17 2010-05-23 19:51:39

Daenyth
Forum Fellow
From: Boston, MA
Registered: 2008-02-24
Posts: 1,244

Re: pacproxy (or something that vaguely resembles an apt-proxy clone)

Pacman supports more than gz and xz. It supports any and all formats readable by libarchive. If you wanted wanted to be pedantic you could go by that, but * is close enough.

Offline

#18 2010-05-24 07:08:52

extofme
Member
From: here + now
Registered: 2009-10-10
Posts: 174
Website

Re: pacproxy (or something that vaguely resembles an apt-proxy clone)

Daenyth wrote:

Pacman supports more than gz and xz. It supports any and all formats readable by libarchive. If you wanted wanted to be pedantic you could go by that, but * is close enough.

i see... you have uncovered the truth of my ignorance!

i will keep that in mind, good info thanks


what am i but an extension of you?

Offline

Board footer

Powered by FluxBB