You are not logged in.
Basically.. I havn't done a pacman -Scc in forever.. So I figured I have enough packages to do a decent experiment. Hopefully, this data will be helpful to those working on the xdelta support, or those that are wondering about it, and other solutions...
My /var/cache/pacman/pkg is 8328.42MB (this is unmodified)
here are the results of my experiment:
default /var/cache/pacman/pkg = 8328.42MB
VERTI /var/cache/pacman/pkg = 6415.27MB 77% 23%
xdelta /var/cache/pacman/pkg = 5796.53MB 70% 30%
7z /var/cache/pacman/pkg = 5251.38MB 63% 37%
xdelta+gzip /var/cache/pacman/pkg = 3671.21MB 44% 56%
xdelta+7z /var/cache/pacman/pkg = 3113.89MB 37% 63%
default: is untouched repo.
Verti: is me using plan9port using vac with a verti server. This was a pain, it involved extracting each package to a tmp dir, and uploading the files to the verti server. Took ~3 hours (maybe a bit more), it only achieved 23% compression.
xdelta: This is using the xdelta -0 command, the reason I did -0 is because....well.. I was lazy..(made it easier to compress for other tests). I assume xdelta -9 will be close to gzip -9. uncompressed delta files outperformed verti (and took much less time). The xdeltas are chained togather.. keeping the most recent pkg as the reference, everything else is an incremental delta.
7z: This was me running gunzip on all the packages, and 7z the tarballs back up. This performed pretty well with 37% reduction. This has the advantage of being able to unzip/install right away, rather then follow a chain of delta patches.
xdelta+gzip: This is gzipping the above deltas (using the default.. which i think is -6) 56% reduction.. this is actually pretty good, 8328.42MB->3671.21MB is a pretty significant reduction.
xdelta+7z: This is me trying to see how far I can take this, and 63% reduction.. not bad.. not bad at all. (xdelta + 7z took about 1 1/2 hours)
The good thing about this test is it was completly random files from 2 years of using arch, so it was a completly random test, and wasn't inteded to favor any one app over the other. The ratio of unique pkg names/total pkgs is 1011/3402. So for every 1 .tar.gz there was 3-4 deltas.
Why xdelta and not other delta generators?
I toyed around with other delta software: bdelta, xdelta3 and bsdiff. I did some limited testing (firefox/kernel/openoffice) with them before deciding to use xdelta as my delta generator.
bdelta: Pretty good, but took almost 3x as long as xdelta. xdelta -0 + 7z compressed to smaller files.
xdelta3: Even though it's supposetly faster, and more efficient.. my tests showed otherwise. It was slightly slower then xdelta and a bit worse compression. *shrug*
bsdiff: bsdiff is.. well hands down the best compressing delta generator of all of them. It would beat 7zipped deltas. The only problem is it requires INSANE amounts of memory and time. I think the kernel was the biggest file I could handle. around 80MB (unzipped tars) maxed out my ram (2Gigs). Well I only gave it 1.6Gigs (virtual machine)... but yea.. the requirements was a bit insane. I couldn't do openoffice with this.
files snippet (full list is available here):
$ du -h firefox-2*
2.6M firefox-2.0.0.10-1--2.0.0.9-1-i686.xdelta.gz
92K firefox-2.0.0.10-3--2.0.0.10-1-i686.xdelta.gz
620K firefox-2.0.0.11-1--2.0.0.10-3-i686.xdelta.gz
2.6M firefox-2.0.0.12-1--2.0.0.11-1-i686.xdelta.gz
1008K firefox-2.0.0.12-2--2.0.0.12-1-i686.xdelta.gz
5.7M firefox-2.0.0.13-1--2.0.0.12-2-i686.xdelta.gz
1.3M firefox-2.0.0.14-1--2.0.0.13-1-i686.xdelta.gz
9.5M firefox-2.0.0.14-1-i686.pkg.tar.gz
1.2M firefox-2.0.0.3-1--2.0.0.2-1-i686.xdelta.gz
1.2M firefox-2.0.0.3-2--2.0.0.3-1-i686.xdelta.gz
84K firefox-2.0.0.3-3--2.0.0.3-2-i686.xdelta.gz
5.6M firefox-2.0.0.4-1--2.0.0.3-3-i686.xdelta.gz
5.7M firefox-2.0.0.5-1--2.0.0.4-1-i686.xdelta.gz
784K firefox-2.0.0.6-1--2.0.0.5-1-i686.xdelta.gz
3.4M firefox-2.0.0.7-1--2.0.0.6-1-i686.xdelta.gz
2.2M firefox-2.0.0.8-2--2.0.0.7-1-i686.xdelta.gz
1.1M firefox-2.0.0.9-1--2.0.0.8-2-i686.xdelta.gz
mmm.. 1MB firefox version upgrade in there... and gotta love the 84k firefox upgrade!
firefox was one of the better ones.
Here's openoffice-base (to see how a big package fared):
$ du -h openoffice-base-2.*
95M openoffice-base-2.2.0-3--2.0.4-1-i686.xdelta.gz
7.2M openoffice-base-2.2.0-4--2.2.0-3-i686.xdelta.gz
54M openoffice-base-2.2.1-2--2.2.0-4-i686.xdelta.gz
30M openoffice-base-2.2.1-3--2.2.1-2-i686.xdelta.gz
59M openoffice-base-2.3.0-1--2.2.1-3-i686.xdelta.gz
7.3M openoffice-base-2.3.0-2--2.3.0-1-i686.xdelta.gz
38M openoffice-base-2.3.0-4--2.3.0-2-i686.xdelta.gz
31M openoffice-base-2.3.1-1--2.3.0-4-i686.xdelta.gz
8.6M openoffice-base-2.3.1-2--2.3.1-1-i686.xdelta.gz
69M openoffice-base-2.4.0-1--2.3.1-2-i686.xdelta.gz
9.2M openoffice-base-2.4.0-2--2.4.0-1-i686.xdelta.gz
112M openoffice-base-2.4.0-2-i686.pkg.tar.gz
PS: I did this using a quick python script, I didn't post the python script because it was a pretty quick (and messy) hack. I'll post it if someone requests/wants it.
Last edited by sabooky (2008-06-10 01:10:31)
Offline
Well, theoretically pacman should work with deltas, but unfortunately our repos don't provide deltas atm. I am a bit unsure about this: the delta implementation may be completely broken. (See pacman-dev ML for details)
A bit offtopic here:
Personally, I don't like the complexity introduced by delta implementation in pacman source code. So if someone knows a working implementation of some general "client-server" delta stuff (efficient "binary-rsync"), which should be ~usable with XferCommand, please let me know... That would be a much cleaner solution.
Offline
What about this ? :
http://projects.archlinux.org/?p=pacman … sh;hb=HEAD
pacman roulette : pacman -S $(pacman -Slq | LANG=C sort -R | head -n $((RANDOM % 10)))
Offline
First, I can see one drawback of my preferred out-of-alpm delta stuff: download size cannot be computed easily.
"wget-xdelta.sh"
Yes. It is a good starting point. (I didn't know about that.) However, it assumes that the repo has yourversion->newversion delta (Otherwise it will download the new package.)
I thought something more general, independent from alpm/pacman. Something similar to our implemented delta code (Maybe a talent programmer could "detach" it from alpm code. I tried, but I could not: I have zero experience in client-server network programming):
1. Client wants to download delta://delta.archlinux.org/.../foo.tar.gz
2. Server sends the list of "delta-graph" vertices (filenames with md5sums).
3. Client searches for matching files in his file container (~package cache) and sends matching filenames.
4. Server sends some delta patches (if the sum of patch-sizes is lower than target file-size, otherwise the whole target-file)
This is just an ambitious idea, but I am pretty sure that the whole open-source community could profit from such tool. (I was surprised, when I realized that there is no powerful client-server stuff atm, I suspect that someone will create such project soon)
Offline
Hello Sabooky
Can you send me or post the example code.
I´m using xdelta for extract binary file difference and my delta files are so big.
Thanks.
Here's the script that I used to generate these files. It barely has any comments.. wasn't intended to be anything more than a quick hack. I didn't bother actually looking at the code, I did run it really quick to test.. and it seems to be working.. seems you just copy it to an empty dir, then run it with no arguments.
#!/usr/bin/env python
import os
import sys
import shutil
import subprocess
# create a dictionary of pkg name->pkgfile.
def get_lookup(dir):
# initial variables
pkg_lookup = {}
# read the directory and parse the files
dir = [x for x in os.listdir(dir) if x.endswith('.pkg.tar.gz')]
for pkg in dir:
name = pkg.replace('-i686.pkg.tar.gz', '')
name = name.rsplit('-',2)[0]
pkg_lookup.setdefault(name,[])
pkg_lookup[name].append(pkg)
# sort the pkgs
for name, files in pkg_lookup.iteritems():
files.sort(key=get_ver,
cmp=lambda x,y: cmp(x[:-1], y[:-1]) or x[-1]-y[-1],
reverse=True)
return pkg_lookup
def get_ver(fname, str=False):
ver = fname.replace('.pkg.tar.gz', '')
ver = ver.replace('-i686', '')
ver = ver.rsplit('-',2)[1:]
if str:
return '-'.join(ver)
ver = ver.pop(0).split('.') + ver
ver = [int(x) for x in ver if x.isdigit()]
return ver
if __name__ == '__main__':
arch = 'i686'
dir = '/var/cache/pacman/pkg'
pkg_lookup = get_lookup(dir)
i = 0
for pkg, files in pkg_lookup.iteritems():
last = None
for file in files:
if not last:
last = file
print 'copying %s' % file
shutil.copy('%s/%s' % (dir,file), '.')
continue
ver_last = get_ver(last,1)
ver_file = get_ver(file,1)
delta_file = '%s-%s--%s-%s.xdelta' % (pkg, ver_last, ver_file, arch)
print 'creating:', delta_file
full_last = '%s/%s' % (dir,last)
full_file = '%s/%s' % (dir,file)
#os.system('xdelta delta -0 %s %s %s' % (full_last, full_file, delta_file))
subprocess.call(['xdelta', 'delta','-0', full_last, full_file, delta_file],
stdout = subprocess.PIPE)
print 'compressing:', delta_file
#os.system('7z a %s.7z %s' % (delta_file,delta_file))
subprocess.call(['7z', 'a', '%s.7z' %delta_file, delta_file],
stdout = subprocess.PIPE)
os.remove(delta_file)
last = file
Offline
xdelta3: Even though it's supposedly faster, and more efficient.. my tests showed otherwise. It was slightly slower then xdelta and a bit worse compression. *shrug*
Well it is not only about speed and efficiency. There is also features and maintenance to consider.
I decided to move to xdelta3 because it looked to be the way forward, and one of its flag solves a lot of issue :
-R disable external recompression (decode)
This allows us to apply our own compression on the packages (gzip -n), and make sure a package created with xdelta will be exactly the same as the original (same md5sum).
xdelta3 has several flags which could be worth playing with :
http://code.google.com/p/xdelta/wiki/CommandLineSyntax
http://code.google.com/p/xdelta/wiki/BetterCompression
http://code.google.com/p/xdelta/wiki/TuningMemoryBudget
And it's quite important to consider the compression / decompression time.
Remember that arch packages uses gz rather than bzip2 because gz is much faster.
pacman roulette : pacman -S $(pacman -Slq | LANG=C sort -R | head -n $((RANDOM % 10)))
Offline
I have a question about it. For example, from the original post:
$ du -h openoffice-base-2.*
95M openoffice-base-2.2.0-3--2.0.4-1-i686.xdelta.gz
7.2M openoffice-base-2.2.0-4--2.2.0-3-i686.xdelta.gz
54M openoffice-base-2.2.1-2--2.2.0-4-i686.xdelta.gz
30M openoffice-base-2.2.1-3--2.2.1-2-i686.xdelta.gz
59M openoffice-base-2.3.0-1--2.2.1-3-i686.xdelta.gz
7.3M openoffice-base-2.3.0-2--2.3.0-1-i686.xdelta.gz
38M openoffice-base-2.3.0-4--2.3.0-2-i686.xdelta.gz
31M openoffice-base-2.3.1-1--2.3.0-4-i686.xdelta.gz
8.6M openoffice-base-2.3.1-2--2.3.1-1-i686.xdelta.gz
69M openoffice-base-2.4.0-1--2.3.1-2-i686.xdelta.gz
9.2M openoffice-base-2.4.0-2--2.4.0-1-i686.xdelta.gz
112M openoffice-base-2.4.0-2-i686.pkg.tar.gz
Ok, so what happens if I have version 2.3.0-2 here and I want to upgrade to 2.4.0-2? I'd have to download (7.3+38+31+8.6+69+9.2)M? In this case it would be better to download the new package directly, right?
Last edited by andre.ramaciotti (2009-03-01 16:14:28)
(lambda ())
Offline
Ok, so what happens if I have version 2.3.0-2 here and I want to upgrade to 2.4.0-2? I'd have to download (7.3+38+31+8.6+69+9.2)M? In this case it would be better to download the new package directly, right?
Pacman has that logic in it. It think if the deltas are >80% of the package download, then it downloads the packages directly.
Offline
Thanks, I just wanted to check.
(lambda ())
Offline
pacman roulette : pacman -S $(pacman -Slq | LANG=C sort -R | head -n $((RANDOM % 10)))
Offline