You are not logged in.

#1 2008-06-10 01:06:37

sabooky
Member
Registered: 2006-11-02
Posts: 89

pacman/xdelta/compression experiment and results (statistics!)

Basically.. I havn't done a pacman -Scc in forever.. So I figured I have enough packages to do a decent experiment. Hopefully, this data will be helpful to those working on the xdelta support, or those that are wondering about it, and other solutions...

My /var/cache/pacman/pkg is 8328.42MB (this is unmodified)
here are the results of my experiment:

default      /var/cache/pacman/pkg  = 8328.42MB
VERTI        /var/cache/pacman/pkg  = 6415.27MB 77% 23%
xdelta       /var/cache/pacman/pkg  = 5796.53MB 70% 30%
7z           /var/cache/pacman/pkg  = 5251.38MB 63% 37%
xdelta+gzip  /var/cache/pacman/pkg  = 3671.21MB 44% 56%
xdelta+7z    /var/cache/pacman/pkg  = 3113.89MB 37% 63%

default: is untouched repo.
Verti: is me using plan9port using vac with a verti server. This was a pain, it involved extracting each package to a tmp dir, and uploading the files to the verti server. Took ~3 hours (maybe a bit more), it only achieved 23% compression.
xdelta: This is using the xdelta -0 command, the reason I did -0 is because....well.. I was lazy..(made it easier to compress for other tests). I assume xdelta -9 will be close to gzip -9. uncompressed delta files outperformed verti (and took much less time). The xdeltas are chained togather.. keeping the most recent pkg as the reference, everything else is an incremental delta.
7z: This was me running gunzip on all the packages, and 7z the tarballs back up. This performed pretty well with 37% reduction. This has the advantage of being able to unzip/install right away, rather then follow a chain of delta patches.
xdelta+gzip: This is gzipping the above deltas (using the default.. which i think is -6) 56% reduction.. this is actually pretty good, 8328.42MB->3671.21MB is a pretty significant reduction.
xdelta+7z: This is me trying to see how far I can take this, and 63% reduction.. not bad.. not bad at all. (xdelta + 7z took about 1 1/2 hours)

The good thing about this test is it was completly random files from 2 years of using arch, so it was a completly random test, and wasn't inteded to favor any one app over the other. The ratio of unique pkg names/total pkgs is 1011/3402. So for every 1 .tar.gz there was 3-4 deltas.


Why xdelta and not other delta generators?
I toyed around with other delta software: bdelta, xdelta3 and bsdiff. I did some limited testing (firefox/kernel/openoffice) with them before deciding to use xdelta as my delta generator.
bdelta: Pretty good, but took almost 3x as long as xdelta. xdelta -0 + 7z compressed to smaller files.
xdelta3: Even though it's supposetly faster, and more efficient.. my tests showed otherwise. It was slightly slower then xdelta and a bit worse compression. *shrug*
bsdiff: bsdiff is.. well hands down the best compressing delta generator of all of them. It would beat 7zipped deltas. The only problem is it requires INSANE amounts of memory and time. I think the kernel was the biggest file I could handle. around 80MB (unzipped tars) maxed out my ram (2Gigs). Well I only gave it 1.6Gigs (virtual machine)... but yea.. the requirements was a bit insane. I couldn't do openoffice with this.

files snippet (full list is available here):

$ du -h firefox-2*
2.6M    firefox-2.0.0.10-1--2.0.0.9-1-i686.xdelta.gz
92K     firefox-2.0.0.10-3--2.0.0.10-1-i686.xdelta.gz
620K    firefox-2.0.0.11-1--2.0.0.10-3-i686.xdelta.gz
2.6M    firefox-2.0.0.12-1--2.0.0.11-1-i686.xdelta.gz
1008K   firefox-2.0.0.12-2--2.0.0.12-1-i686.xdelta.gz
5.7M    firefox-2.0.0.13-1--2.0.0.12-2-i686.xdelta.gz
1.3M    firefox-2.0.0.14-1--2.0.0.13-1-i686.xdelta.gz
9.5M    firefox-2.0.0.14-1-i686.pkg.tar.gz
1.2M    firefox-2.0.0.3-1--2.0.0.2-1-i686.xdelta.gz
1.2M    firefox-2.0.0.3-2--2.0.0.3-1-i686.xdelta.gz
84K     firefox-2.0.0.3-3--2.0.0.3-2-i686.xdelta.gz
5.6M    firefox-2.0.0.4-1--2.0.0.3-3-i686.xdelta.gz
5.7M    firefox-2.0.0.5-1--2.0.0.4-1-i686.xdelta.gz
784K    firefox-2.0.0.6-1--2.0.0.5-1-i686.xdelta.gz
3.4M    firefox-2.0.0.7-1--2.0.0.6-1-i686.xdelta.gz
2.2M    firefox-2.0.0.8-2--2.0.0.7-1-i686.xdelta.gz
1.1M    firefox-2.0.0.9-1--2.0.0.8-2-i686.xdelta.gz

mmm.. 1MB firefox version upgrade in there... and gotta love the 84k firefox upgrade! smile
firefox was one of the better ones.
Here's openoffice-base (to see how a big package fared):

$ du -h openoffice-base-2.*
95M     openoffice-base-2.2.0-3--2.0.4-1-i686.xdelta.gz
7.2M    openoffice-base-2.2.0-4--2.2.0-3-i686.xdelta.gz
54M     openoffice-base-2.2.1-2--2.2.0-4-i686.xdelta.gz
30M     openoffice-base-2.2.1-3--2.2.1-2-i686.xdelta.gz
59M     openoffice-base-2.3.0-1--2.2.1-3-i686.xdelta.gz
7.3M    openoffice-base-2.3.0-2--2.3.0-1-i686.xdelta.gz
38M     openoffice-base-2.3.0-4--2.3.0-2-i686.xdelta.gz
31M     openoffice-base-2.3.1-1--2.3.0-4-i686.xdelta.gz
8.6M    openoffice-base-2.3.1-2--2.3.1-1-i686.xdelta.gz
69M     openoffice-base-2.4.0-1--2.3.1-2-i686.xdelta.gz
9.2M    openoffice-base-2.4.0-2--2.4.0-1-i686.xdelta.gz
112M    openoffice-base-2.4.0-2-i686.pkg.tar.gz

PS: I did this using a quick python script, I didn't post the python script because it was a pretty quick (and messy) hack. I'll post it if someone requests/wants it.

Last edited by sabooky (2008-06-10 01:10:31)

Offline

#2 2008-09-24 13:32:21

ngaba
Pacman Developer
Registered: 2008-05-13
Posts: 16

Re: pacman/xdelta/compression experiment and results (statistics!)

Well, theoretically pacman should work with deltas, but unfortunately our repos don't provide deltas atm. I am a bit unsure about this: the delta implementation may be completely broken. (See pacman-dev ML for details)

A bit offtopic here:
Personally, I don't like the complexity introduced by delta implementation in pacman source code. So if someone knows a working implementation of some general "client-server" delta stuff (efficient "binary-rsync"), which should be ~usable with XferCommand, please let me know... That would be a much cleaner solution.

Offline

#3 2008-09-24 16:28:29

shining
Pacman Developer
Registered: 2006-05-10
Posts: 2,043

Re: pacman/xdelta/compression experiment and results (statistics!)


pacman roulette : pacman -S $(pacman -Slq | LANG=C sort -R | head -n $((RANDOM % 10)))

Offline

#4 2008-09-25 12:11:44

ngaba
Pacman Developer
Registered: 2008-05-13
Posts: 16

Re: pacman/xdelta/compression experiment and results (statistics!)

First, I can see one drawback of my preferred out-of-alpm delta stuff: download size cannot be computed easily.

"wget-xdelta.sh"
Yes. It is a good starting point. (I didn't know about that.) However, it assumes that the repo has yourversion->newversion delta (Otherwise it will download the new package.)

I thought something more general, independent from alpm/pacman. Something similar to our implemented delta code (Maybe a talent programmer could "detach" it from alpm code. I tried, but I could not: I have zero experience in client-server network programming):
1. Client wants to download delta://delta.archlinux.org/.../foo.tar.gz
2. Server sends the list of "delta-graph" vertices (filenames with md5sums).
3. Client searches for matching files in his file container (~package cache) and sends matching filenames.
4. Server sends some delta patches (if the sum of patch-sizes is lower than target file-size, otherwise the whole target-file)

This is just an ambitious idea, but I am pretty sure that the whole open-source community could profit from such tool. (I was surprised, when I realized that there is no powerful client-server stuff atm, I suspect that someone will create such project soon)

Offline

#5 2008-10-14 08:02:22

sabooky
Member
Registered: 2006-11-02
Posts: 89

Re: pacman/xdelta/compression experiment and results (statistics!)

mazinger wrote:

Hello Sabooky

Can you send me or post the example code.

I´m using xdelta for extract binary file difference and my delta files are so big.



Thanks.

Here's the script that I used to generate these files. It barely has any comments.. wasn't intended to be anything more than a quick hack. I didn't bother actually looking at the code, I did run it really quick to test.. and it seems to be working..  seems you just copy it to an empty dir, then run it with no arguments.


#!/usr/bin/env python

import os
import sys
import shutil
import subprocess


# create a dictionary of pkg name->pkgfile.
def get_lookup(dir):
    # initial variables
    pkg_lookup = {}
    # read the directory and parse the files
    dir = [x for x in os.listdir(dir) if x.endswith('.pkg.tar.gz')]
    for pkg in dir:
        name = pkg.replace('-i686.pkg.tar.gz', '')
        name = name.rsplit('-',2)[0]
        pkg_lookup.setdefault(name,[])
        pkg_lookup[name].append(pkg)
    # sort the pkgs
    for name, files in pkg_lookup.iteritems():
        files.sort(key=get_ver,
                   cmp=lambda x,y: cmp(x[:-1], y[:-1]) or x[-1]-y[-1],
                   reverse=True)

    return pkg_lookup

def get_ver(fname, str=False):
    ver = fname.replace('.pkg.tar.gz', '')
    ver = ver.replace('-i686', '')
    ver = ver.rsplit('-',2)[1:]
    if str:
        return '-'.join(ver)
    ver = ver.pop(0).split('.') + ver
    ver = [int(x) for x in ver if x.isdigit()]
    return ver

if __name__ == '__main__':
    arch = 'i686'
    dir = '/var/cache/pacman/pkg'
    pkg_lookup = get_lookup(dir)
    i = 0
    for pkg, files in pkg_lookup.iteritems():
        last = None
        for file in files:
            if not last:
                last = file
                print 'copying %s' % file
                shutil.copy('%s/%s' % (dir,file), '.')
                continue
            ver_last = get_ver(last,1)
            ver_file = get_ver(file,1)
            delta_file =  '%s-%s--%s-%s.xdelta' % (pkg, ver_last, ver_file, arch)
            print 'creating:', delta_file
            full_last = '%s/%s' % (dir,last)
            full_file = '%s/%s' % (dir,file)
            #os.system('xdelta delta -0 %s %s %s' % (full_last, full_file, delta_file))
            subprocess.call(['xdelta', 'delta','-0', full_last, full_file, delta_file],
                            stdout = subprocess.PIPE)
            print 'compressing:', delta_file
            #os.system('7z a %s.7z %s' % (delta_file,delta_file))
            subprocess.call(['7z', 'a', '%s.7z' %delta_file, delta_file],
                            stdout = subprocess.PIPE)
            os.remove(delta_file)
            last = file

Offline

#6 2009-02-27 20:36:58

shining
Pacman Developer
Registered: 2006-05-10
Posts: 2,043

Re: pacman/xdelta/compression experiment and results (statistics!)

sabooky wrote:

xdelta3: Even though it's supposedly faster, and more efficient.. my tests showed otherwise. It was slightly slower then xdelta and a bit worse compression. *shrug*

Well it is not only about speed and efficiency. There is also features and maintenance to consider.
I decided to move to xdelta3 because it looked to be the way forward, and one of its flag solves a lot of issue :
   -R           disable external recompression (decode)
This allows us to apply our own compression on the packages (gzip -n), and make sure a package created with xdelta will be exactly the same as the original (same md5sum).

xdelta3 has several flags which could be worth playing with :
http://code.google.com/p/xdelta/wiki/CommandLineSyntax
http://code.google.com/p/xdelta/wiki/BetterCompression
http://code.google.com/p/xdelta/wiki/TuningMemoryBudget

And it's quite important to consider the compression / decompression time.
Remember that arch packages uses gz rather than bzip2 because gz is much faster.


pacman roulette : pacman -S $(pacman -Slq | LANG=C sort -R | head -n $((RANDOM % 10)))

Offline

#7 2009-03-01 16:13:48

andre.ramaciotti
Member
From: Brazil
Registered: 2007-04-06
Posts: 649

Re: pacman/xdelta/compression experiment and results (statistics!)

I have a question about it. For example, from the original post:

$ du -h openoffice-base-2.*
95M     openoffice-base-2.2.0-3--2.0.4-1-i686.xdelta.gz
7.2M    openoffice-base-2.2.0-4--2.2.0-3-i686.xdelta.gz
54M     openoffice-base-2.2.1-2--2.2.0-4-i686.xdelta.gz
30M     openoffice-base-2.2.1-3--2.2.1-2-i686.xdelta.gz
59M     openoffice-base-2.3.0-1--2.2.1-3-i686.xdelta.gz
7.3M    openoffice-base-2.3.0-2--2.3.0-1-i686.xdelta.gz
38M     openoffice-base-2.3.0-4--2.3.0-2-i686.xdelta.gz
31M     openoffice-base-2.3.1-1--2.3.0-4-i686.xdelta.gz
8.6M    openoffice-base-2.3.1-2--2.3.1-1-i686.xdelta.gz
69M     openoffice-base-2.4.0-1--2.3.1-2-i686.xdelta.gz
9.2M    openoffice-base-2.4.0-2--2.4.0-1-i686.xdelta.gz
112M    openoffice-base-2.4.0-2-i686.pkg.tar.gz

Ok, so what happens if I have version 2.3.0-2 here and I want to upgrade to 2.4.0-2? I'd have to download (7.3+38+31+8.6+69+9.2)M? In this case it would be better to download the new package directly, right?

Last edited by andre.ramaciotti (2009-03-01 16:14:28)


(lambda ())

Offline

#8 2009-03-01 16:15:48

Allan
is always right
From: Brisbane, AU
Registered: 2007-06-09
Posts: 10,463
Website

Re: pacman/xdelta/compression experiment and results (statistics!)

andre.ramaciotti wrote:

Ok, so what happens if I have version 2.3.0-2 here and I want to upgrade to 2.4.0-2? I'd have to download (7.3+38+31+8.6+69+9.2)M? In this case it would be better to download the new package directly, right?

Pacman has that logic in it.  It think if the deltas are >80% of the package download, then it downloads the packages directly.

Offline

#9 2009-03-01 17:08:28

andre.ramaciotti
Member
From: Brazil
Registered: 2007-04-06
Posts: 649

Re: pacman/xdelta/compression experiment and results (statistics!)

Thanks, I just wanted to check. smile


(lambda ())

Offline

#10 2009-03-01 20:44:43

shining
Pacman Developer
Registered: 2006-05-10
Posts: 2,043

Re: pacman/xdelta/compression experiment and results (statistics!)


pacman roulette : pacman -S $(pacman -Slq | LANG=C sort -R | head -n $((RANDOM % 10)))

Offline

Board footer

Powered by FluxBB