You are not logged in.

#126 2007-05-30 15:16:15

phrakture
Arch Overlord
From: behind you
Registered: 2003-10-29
Posts: 7,879
Website

Re: Binary Diffs for Pacman, a detailed proposal + evidence

space-m0nkey wrote:
phrakture wrote:
space-m0nkey wrote:

To make it easier for the devs to integrate this into pacman (there's been a lot of changes in makepkg since 3.0.4) I've created a git branch for the xdelta changes.

http://repo.or.cz/w/pacman.git?a=shortlog;h=xdelta

Hey Andrew,
I'm not entirely sure on this, but I thought that dale should be indicated somewhere in these patches (I mean, he did do most of the work here), via the --author flag to git-commit.

Seeing as it looks to be only two patches, could you possibly commit them with some indication of where it came from, just so I can give the proper credit when I pull this (I pulled now, but didn't merge to the master branch just yet due to this issue).

Ooops my bad, I've fixed the patches, added the xfer script, and made Dale the author.

Great, thanks.  Just out of curiosity, how did you do it? I could think of 3 or so ways, but none of them seemed ideal.  The way I would have done it, considering it's only 2 patches, is to delete that branch and just readd them.

Offline

#127 2007-05-30 15:16:24

shining
Pacman Developer
Registered: 2006-05-10
Posts: 2,043

Re: Binary Diffs for Pacman, a detailed proposal + evidence

billy wrote:

but maybe this could be useful for having binary packages with different functionality enabled just by applying an xdelta to it.

Switching to a package with different functionality / dependencies is not the common case. And besides, it seems to me arch doesn't have a lot of packages like that (which is not necessarily a good thing, but that's another problem).
The common case is to just run pacman -Syu which will just upgrade the existing packages, so that's the case where it's the most important to have xdelta.
Take for example twinkle and twinkle-kdefree. If you install it, or have it installed, how often will you switch between both, and how often will you upgrade the one you chose ?
Also, what if there are more than two versions, for example foo-dep1, foo-dep2, foo-dep3. You want a xdelta for each pair, or just between the main one and all others ?

But I don't know, it could still be a good idea nevertheless. If xdelta works well in this case (though that's not sure), and if it's easy to implement, or if many packages would benefit from this, then why not?


pacman roulette : pacman -S $(pacman -Slq | LANG=C sort -R | head -n $((RANDOM % 10)))

Offline

#128 2007-05-30 15:19:48

space-m0nkey
Member
From: UK
Registered: 2007-03-26
Posts: 16

Re: Binary Diffs for Pacman, a detailed proposal + evidence

phrakture wrote:
space-m0nkey wrote:
phrakture wrote:

Hey Andrew,
I'm not entirely sure on this, but I thought that dale should be indicated somewhere in these patches (I mean, he did do most of the work here), via the --author flag to git-commit.

Seeing as it looks to be only two patches, could you possibly commit them with some indication of where it came from, just so I can give the proper credit when I pull this (I pulled now, but didn't merge to the master branch just yet due to this issue).

Ooops my bad, I've fixed the patches, added the xfer script, and made Dale the author.

Great, thanks.  Just out of curiosity, how did you do it? I could think of 3 or so ways, but none of them seemed ideal.  The way I would have done it, considering it's only 2 patches, is to delete that branch and just readd them.

Deleting the branch is the easiest way i know of...

git co master
git branch -D xdelta
git co -b xdelta
...patch...
git commit --author
git push --force ...


"Instead, people would take pains to tell her that beauty was only skin-deep, as if a man ever fell for an attractive pair of kidneys."
(Terry Pratchett, Maskerade)

Offline

#129 2007-05-30 15:32:10

billy
Member
From: Slovenia
Registered: 2006-09-13
Posts: 164

Re: Binary Diffs for Pacman, a detailed proposal + evidence

shining wrote:

Also, what if there are more than two versions, for example foo-dep1, foo-dep2, foo-dep3. You want a xdelta for each pair, or just between the main one and all others ?

yeah, i thought of this to. i don't think there is any way of "mixing" together different xdeltas of one package.

Offline

#130 2007-05-30 18:29:24

Thikasabrik
Member
Registered: 2004-02-23
Posts: 92

Re: Binary Diffs for Pacman, a detailed proposal + evidence

Okay. I wanna break down what I found, which is beginning to result in a course of action, at least in my head...

I think we should adopt my patches for makepkg.in (assuming we stick to xdelta 1), and make a corresponding change to the downloader script (which could take one of two forms), because..

1. We need to use the same compressor in package creation and patching to have a chance at holding the md5sum of the tar.gz constant (as is well known).
2. A single version of gzip (where the algorithm is identical) should always produce identical output when:
   a. the source is the same
   b. the -n option is used to avoid saving a timestamp (xdelta does this)
   c. all other gz parameters (including any only accessible if using zlib) are the same
(someone yell if there are factors missing here)
3. We can achieve the above using xdelta's compression in both cases or using gzip -n in both cases, but not with a mixture of the two because xdelta sets up compression somehow differently to a normal gzip invocation wrt the 'other parameters'.
4. Using gzip -n means that a problem corner case which could come up in practice* is avoided, so assume this is the way to go.
5. This means we either use more disk space during patching, or take a little more CPU time depending on whether patching involves:
  a. having two decompressed tars in /tmp and then compressing the result tar with gzip -n
  b. having one decompressed tar in /tmp and then re-compressing the result tar with gzip -n (after xdelta compresses it)
  c. there is no c, since xdelta 1 is rather inflexible
(in any case the makepkg part is the same)

Dale, what do you think?

*The problem case is when a packager creates the package without deltas first (gzip invoked to compress) and then comes back and creates the deltas later (xdelta invoked to compress) and only uploads the altered deltas. Not too implausible I think - one would not expect this to be a problem. Or at least, if it isn't a problem we can now have a makepkg flag to 'just create deltas' once the pkg is already built. Conversly, if such an option is not desirable, and the corner case will not arise in practice, then we could just go on using xdelta as the common compressor. In that case I will have wasted some time, but it's not as if I had anything better to do (it's been a great way to avoid revision..:D)

Last edited by Thikasabrik (2007-05-30 18:43:40)

Offline

#131 2007-05-31 05:00:28

phrakture
Arch Overlord
From: behind you
Registered: 2003-10-29
Posts: 7,879
Website

Re: Binary Diffs for Pacman, a detailed proposal + evidence

Pushed the patches from Andrew's tree to the master git tree: http://projects.archlinux.org/git/gitwe … ;a=summary

Offline

#132 2007-06-02 00:55:54

dale77
Member
From: Down under
Registered: 2007-02-10
Posts: 102
Website

Re: Binary Diffs for Pacman, a detailed proposal + evidence

jaybob20 wrote:

Sorry to get into the mix a little late, but I had played binary diffs before and was looking again when I came across this thread. I had the best luck with bsdiff.

I forget exactly why I stayed with xdelta, probably something to do with the fact that it is what most people seem to be using. Or maybe some technicial thing like support for gzipped content or somesuch.

My concern is mostly for a solution, the exactly optimal solution for patch size (you have demonstrated that time is on the xdelta side) will come with time if required. That optimal solution will probably be only a few percent better than xdelta which gives us the majority of the payback.

My 2c (as I try and figure out this git thing to get a look at the latest code)

Offline

#133 2007-06-02 03:07:10

space-m0nkey
Member
From: UK
Registered: 2007-03-26
Posts: 16

Re: Binary Diffs for Pacman, a detailed proposal + evidence

dale77 wrote:

My 2c (as I try and figure out this git thing to get a look at the latest code)

I recommend http://www.kernel.org/pub/software/scm/ … ryday.html as a good starting point for learning git.


"Instead, people would take pains to tell her that beauty was only skin-deep, as if a man ever fell for an attractive pair of kidneys."
(Terry Pratchett, Maskerade)

Offline

#134 2007-06-02 05:09:07

dale77
Member
From: Down under
Registered: 2007-02-10
Posts: 102
Website

Re: Binary Diffs for Pacman, a detailed proposal + evidence

space-m0nkey wrote:
dale77 wrote:

My 2c (as I try and figure out this git thing to get a look at the latest code)

I recommend http://www.kernel.org/pub/software/scm/ … ryday.html as a good starting point for learning git.

Thanks. The following got me going..

git clone http://projects.archlinux.org/git/pacman.git
cd pacman
./autogen.sh
./configure --prefix=/usr --sysconfdir=/etc --localstatedir=/var
make

Last edited by dale77 (2007-06-03 06:59:00)

Offline

#135 2007-06-02 13:19:14

space-m0nkey
Member
From: UK
Registered: 2007-03-26
Posts: 16

Re: Binary Diffs for Pacman, a detailed proposal + evidence

If your just interested in the code and don't want to mess with git I've got snapshots of the source on my server, it's updated everyday.

http://linux.neptune-one.net/pacman/pacman-git.tar.bz2


"Instead, people would take pains to tell her that beauty was only skin-deep, as if a man ever fell for an attractive pair of kidneys."
(Terry Pratchett, Maskerade)

Offline

#136 2007-06-02 15:19:05

Thikasabrik
Member
Registered: 2004-02-23
Posts: 92

Re: Binary Diffs for Pacman, a detailed proposal + evidence

You know what? I' m starting to think the md5sum business is a non-issue. Mainly because the hole only exists if the packager uploads a delta produced on one makepkg run alongside a pkg.tar.gz from another. Different compression engines (gzip vs zlib) or timestamps are, of course, not the only way different md5 sums could be produced - if there is a difference in the pkg dir contents (maybe just a timestamp difference) then this will cause the same problem.

So: We should just have a big ol' warning somewhere telling packagers not to mix and match deltas and tarballs from different makepkg runs. Sound ok?

Offline

#137 2007-06-02 15:49:42

Thikasabrik
Member
Registered: 2004-02-23
Posts: 92

Re: Binary Diffs for Pacman, a detailed proposal + evidence

Okay, here's my suggestion. This patch adds a warning and fixes a bug that prevents makepkg finding old versions of packages.
It also cleans up message output.

--- scripts/makepkg.in    2007-06-02 16:28:47.000000000 +0100
+++ scripts/makepkg.in.1    2007-06-02 16:46:47.000000000 +0100
@@ -608,12 +608,12 @@
 
     local pkg_file=$1
     local cache_dir="/var/cache/pacman/pkg" # TODO: autoconf me
-    local old_versions=( $(ls {"$cache_dir","$PKGDEST"}/${pkgname}-*-${CARCH}.${PKGEXT} 2>/dev/null) )
+    local old_versions=( $(ls {"$cache_dir","$PKGDEST"}/${pkgname}-*-${CARCH}${PKGEXT} 2>/dev/null) )
 
     # Check to see if we have any old versions to create deltas with
     local old_file old_version latest_version base_file
     for old_file in "${old_versions[@]}"; do
-        old_version=$(basename "${old_file%-$CARCH.$PKGEXT}")
+        old_version=$(basename "${old_file%-$CARCH$PKGEXT}")
         old_version=${old_version#$pkgname-}
     
         # old_version may include the target package, only use the old versions
@@ -624,7 +624,7 @@
     done
 
     if [ "$base_file" != "" ]; then
-        msg "Making delta from version $latest_version"
+        msg "$(gettext "Making delta from version $latest_version")"
         local delta_file="$PKGDEST/$pkgname-${old_version}_to_$pkgver-$pkgrel-$CARCH.delta"
 
         # xdelta will decompress base_file & pkg_file into TMP_DIR (or /tmp if TMP_DIR is unset)
@@ -636,9 +636,11 @@
         #
         # makepkg and pacman must use the same compression algorithm or the delta generated
         # package may not match, producing md5 checksum errors.
+        msg2 "$(gettext "Recreating package tarball from delta to match up md5 signatures")"
+        msg2 "$(gettext "NOTE: the delta should ONLY be distributed with this tarball")"
         xdelta patch "$delta_file" "$base_file" "$pkg_file"
     else
-        msg "No previous version found, skipping xdelta"
+        warning "$(gettext "No previous version found, skipping xdelta")"
     fi
 }
 
_

Please apply (applies cleanly to git head). edit: applied

Note: This supercedes all my previous suggestions/patches.

(copied to pacman-dev mailing list)

Last edited by Thikasabrik (2007-06-02 18:17:53)

Offline

#138 2007-06-03 10:35:20

dale77
Member
From: Down under
Registered: 2007-02-10
Posts: 102
Website

Re: Binary Diffs for Pacman, a detailed proposal + evidence

I've had a look at the new code and I fixed a few "issues" - hopefully added no new ones big_smile

makepkg

1. The old version detection would confuse sub-packages with a base package. i.e. python-foo oldversions would be returned for the python package.
2. I made the oldversion detection CARCH agnostic, it should work on pkg-1.0-1-i686.pkg.tar.gz and pkg-1.0-1.pkg.tar.gz
3. There was a bug with the name of the generated delta file

--- test/pacman/scripts/makepkg    2007-06-03 22:04:53.000000000 +1200
+++ bin/makepkg-new2    2007-06-03 22:14:15.000000000 +1200
@@ -760,13 +760,17 @@
 
     local pkg_file=$1
     local cache_dir="/var/cache/pacman/pkg" # TODO: autoconf me
-    local old_versions=( $(ls {"$cache_dir","$PKGDEST"}/${pkgname}-*-${CARCH}${PKGEXT} 2>/dev/null) )
+    local old_versions=( $(ls {"$cache_dir","$PKGDEST"}/${pkgname}-[0-9]*-[0-9]*${PKGEXT} 2>/dev/null) )
 
     # Check to see if we have any old versions to create deltas with
     local old_file old_version latest_version base_file
     for old_file in "${old_versions[@]}"; do
-        old_version=$(basename "${old_file%-$CARCH$PKGEXT}")
-        old_version=${old_version#$pkgname-}
+    if [[ "$old_file" =~ "$CARCH" ]]; then # if old_file contains CARCH
+      old_version=$(basename "${old_file%-$CARCH$PKGEXT}")
+    else
+      old_version=$(basename "${old_file%$PKGEXT}")
+    fi
+    old_version=${old_version#$pkgname-}
     
         # old_version may include the target package, only use the old versions
         if [ "$old_version" != "$pkgver-$pkgrel" ] && [[ "$old_version" > "$latest_version" ]]; then
@@ -777,7 +781,7 @@
 
     if [ "$base_file" != "" ]; then
         msg "$(gettext "Making delta from version %s...")" "$latest_version"
-        local delta_file="$PKGDEST/$pkgname-${old_version}_to_$pkgver-$pkgrel-$CARCH.delta"
+        local delta_file="$PKGDEST/$pkgname-${latest_version}_to_$pkgver-$pkgrel-$CARCH.delta"
 
         # xdelta will decompress base_file & pkg_file into TMP_DIR (or /tmp if
         # TMP_DIR is unset) then perform the delta on the resulting tars

wget-xdelta.sh

1. The old version detection would confuse sub-packages with a base package. i.e. python-foo oldversions would be returned for the python package.
2. I made the oldversion detection CARCH agnostic, it should work on pkg-1.0-1-i686.pkg.tar.gz and pkg-1.0-1.pkg.tar.gz
3. I added some code to display bytes saved by xdelta

--- home/dale/test/pacman/contrib/wget-xdelta.sh    2007-06-03 21:55:37.000000000 +1200
+++ usr/bin/wget-xdelta.sh    2007-06-03 21:57:40.000000000 +1200
@@ -6,14 +6,22 @@
 # Only check for pkg.tar.gz files in the cache, we download db.tar.gz as well
 if [[ "$o" =~ "pkg.tar.gz" ]] # if $o contains pkg.tar.gz
 then
-  pkgname=${o%-*-[0-9]-${CARCH}.pkg.tar.gz.part}   # Parse out the package name
+  pkgname=${o%-[0-9]*-[0-9]*.pkg.tar.gz.part}   # Parse out the package name
   newend=${o##$pkgname-}                  # Parse out everything following pkgname
-  new_version=${newend%-${CARCH}.pkg.tar.gz.part}  # Strip off .pkg.tar.gz.part leaving version
+  if [[ "$newend" =~ $CARCH ]]; then
+    new_version=${newend%-*}  # Strip off everything after last hyphen leaving version
+  else
+    new_version=${newend%.pkg.tar.gz.part} # Strip off the package extension leaving version
+  fi
   url=${u%/*}
-  for cached_file in $(ls -r /var/cache/pacman/pkg/${pkgname}-*-${CARCH}.pkg.tar.gz 2>/dev/null); do
+  for cached_file in $(ls -r /var/cache/pacman/pkg/${pkgname}-[0-9]*-[0-9]*.pkg.tar.gz 2>/dev/null); do
     # just take the first one, by name. I suppose we could take the latest by date...
     oldend=${cached_file##*/$pkgname-}
-    old_version=${oldend%-${CARCH}.pkg.tar.gz}
+    if [[ "$oldend" =~ $CARCH ]]; then
+      old_version=${oldend%-*}  # Strip off everything after last hyphen leaving version
+    else
+      old_version=${oldend%.pkg.tar.gz} # Strip off the package name
+    fi
     if [ "$old_version" = "$new_version" ]; then
       # We already have the new version in the cache! Just continue the download.
       cached_file=""
@@ -28,9 +36,12 @@
   if wget --passive-ftp -c $url/$delta_name; then
     # Now apply the delta to the cached file to produce the new file
     echo Applying delta...
+    delta_size=$(stat --format=%s $delta_name)
     if xdelta patch $delta_name $cached_file $o; then
       # Remove the delta now that we are finished with it
       rm $delta_name
+      package_size=$(stat --format=%s $o)
+      echo $(( $package_size-$delta_size )) bytes saved by xdelta.
     else
       # Hmmm. xdelta failed for some reason
       rm $delta_name

Offline

#139 2007-06-03 19:01:33

Thikasabrik
Member
Registered: 2004-02-23
Posts: 92

Re: Binary Diffs for Pacman, a detailed proposal + evidence

I made a script (available on request) to test delta engines, and I tested xdelta3 (using secondary djw compression) vs xdelta 1 on a selection of packages in my cache. Here's what I find:

xdelta round 1:
Time to make deltas: 65 s
Time to apply them: 52 s
Total deltas: 36045088 bytes
Total packages: 125157309 bytes
Difference = 89112221 bytes

xdelta3 round 1:
Time to make deltas: 69 s
Time to apply them: 41 s
Total deltas: 36224447 bytes
Total packages: 125157309 bytes
Difference = 88932862 bytes

xdelta round 2:
Time to make deltas: 60 s
Time to apply them: 50 s
Total deltas: 36045088 bytes
Total packages: 125157309 bytes
Difference = 89112221 bytes

xdelta3 round 2:
Time to make deltas: 68 s
Time to apply them: 42 s
Total deltas: 36224447 bytes
Total packages: 125157309 bytes
Difference = 88932862 bytes

There doesn't seem to be much to choose between them. xdelta3 gets much slower if I bump up its compression setting, although the deltas then tend to come out smaller than xdelta 1. If I tell xdelta3 not to use secondary compression it is a bit faster, but the deltas come out bigger (I can post more results if people are interested).
The only advantages of xdelta3 I see right now is its reduced disk-space usage during delta creation, and the possibility of completely closing the rather unimportant md5 hole without extra patching time/disc-space. That and it seems to be generally a bit quicker in patching.

The following packages were tested:

docbook-xsl-1.71.1-2.pkg.tar.gz            iso-codes-1.0-1.pkg.tar.gz
docbook-xsl-1.72.0-1.pkg.tar.gz            iso-codes-1.1-1.pkg.tar.gz
epiphany-2.18.1-2.pkg.tar.gz               kernel26-2.6.21.3-1.pkg.tar.gz
epiphany-2.18.2-1.pkg.tar.gz               kernel26-2.6.21.3-2.pkg.tar.gz
evolution-2.10.1-2.pkg.tar.gz              openal-0.0.8-2.pkg.tar.gz
evolution-2.10.2-1.pkg.tar.gz              openal-0.0.8-3.pkg.tar.gz
evolution-data-server-1.10.1-2.pkg.tar.gz  vte-0.16.4-1.pkg.tar.gz
evolution-data-server-1.10.2-1.pkg.tar.gz  vte-0.16.4-2.pkg.tar.gz
gcc-4.1.2-4.pkg.tar.gz                     xorg-server-1.2.0-5.pkg.tar.gz
gcc-4.2.0-2.pkg.tar.gz                     xorg-server-1.3.0.0-1.pkg.tar.gz
gimp-2.2.14-1.pkg.tar.gz                   xulrunner-1.8.1.3-1.pkg.tar.gz
gimp-2.2.15-1.pkg.tar.gz                   xulrunner-1.8.1.4-1.pkg.tar.gz
gnome-games-2.18.1.1-1.pkg.tar.gz          zenity-2.18.1-1.pkg.tar.gz
gnome-games-2.18.2-1.pkg.tar.gz            zenity-2.18.2-1.pkg.tar.gz

Last edited by Thikasabrik (2007-06-03 19:37:33)

Offline

#140 2007-06-04 10:51:37

Thikasabrik
Member
Registered: 2004-02-23
Posts: 92

Re: Binary Diffs for Pacman, a detailed proposal + evidence

Oh, and bsdiff is most definitely out the window.. It is not 'pipable' and does not do (de)compression itself. Thus it has to have the gz files decompressed for it or it will try to diff them without decompression, which makes for large deltas.
When it has tars to work with it produces very small deltas (smaller than xdelta 1 or 3), but it takes ages to do it (30s for a 9mb tar). It also needs "an absolute minimum of 8 times the size of oldfile" as a working memory set for diffing. If we take a kernel, for example, that's 8x60 = 480 mb. A little steep, if you ask me.
Patching is fast with bsdiff, and the diffs are small, but I think this is too much to ask of our package maintainers.

Last edited by Thikasabrik (2007-06-04 10:52:55)

Offline

#141 2007-06-04 11:34:02

dale77
Member
From: Down under
Registered: 2007-02-10
Posts: 102
Website

Re: Binary Diffs for Pacman, a detailed proposal + evidence

Thanks for that data on xdelta/bsdiff.

Hopefully soon (pacman 3.1?) makepkg xdelta will be out in the wild and we'll be able to benefit with some decent bandwidth savings.

Offline

#142 2007-10-19 22:22:59

dale77
Member
From: Down under
Registered: 2007-02-10
Posts: 102
Website

Re: Binary Diffs for Pacman, a detailed proposal + evidence

It's exciting to see that some xdelta work is happening in the dev mailing list. Looks like things have moved on a fair way from where I left it. Could one of the devs please update me on when/where pacman is going with xdelta? I'd like to help out if required.

Thanks

Dale

Offline

#143 2007-10-19 23:11:52

cactus
Taco Eater
From: t͈̫̹ͨa͖͕͎̱͈ͨ͆ć̥̖̝o̫̫̼s͈̭̱̞͍̃!̰
Registered: 2004-05-25
Posts: 4,622
Website

Re: Binary Diffs for Pacman, a detailed proposal + evidence

I recommend asking on the mailing list.
That is where the real pacman dev action occurs.


"Be conservative in what you send; be liberal in what you accept." -- Postel's Law
"tacos" -- Cactus' Law
"t̥͍͎̪̪͗a̴̻̩͈͚ͨc̠o̩̙͈ͫͅs͙͎̙͊ ͔͇̫̜t͎̳̀a̜̞̗ͩc̗͍͚o̲̯̿s̖̣̤̙͌ ̖̜̈ț̰̫͓ạ̪͖̳c̲͎͕̰̯̃̈o͉ͅs̪ͪ ̜̻̖̜͕" -- -̖͚̫̙̓-̺̠͇ͤ̃ ̜̪̜ͯZ͔̗̭̞ͪA̝͈̙͖̩L͉̠̺͓G̙̞̦͖O̳̗͍

Offline

#144 2007-10-22 16:00:26

phrakture
Arch Overlord
From: behind you
Registered: 2003-10-29
Posts: 7,879
Website

Re: Binary Diffs for Pacman, a detailed proposal + evidence

dale77 wrote:

It's exciting to see that some xdelta work is happening in the dev mailing list. Looks like things have moved on a fair way from where I left it. Could one of the devs please update me on when/where pacman is going with xdelta? I'd like to help out if required.

Thanks

Dale

We merged some patches from Nathan Jones here (top 5 at the time of this writing):
http://projects.archlinux.org/git/?p=pa … ;a=summary

Feel free to comment, but, as cactus said, the ML is where this'd get more exposure.

Offline

Board footer

Powered by FluxBB