You are not logged in.

#1 2004-07-13 19:31:43

cjdj
Member
From: Perth, Western Australia
Registered: 2004-05-07
Posts: 121

package diffs

I've seen quite a few requests by people on dial-up connections for a method to download diff packages instead of full ones.   

Since I have a high speed connection, I havent been too concerned about it, but I am about to roll-out a bunch of new machines and started thinking that downloading diff's of packages might be a good idea after all.

So I thought about it a little bit and have come up with a solution that will work on top of what is already there.  No changes would be needed with pacman.   In fact, I've messed around with it for the past couple of hours and am almost done.

Just for the proof-of-concept, I've written it in perl, but if people actually start using it, I might re-write it in C.   

Basically, this is what it does.  There are two parts to it. 

The server Part
I have a perl script on my webserver that runs ever half hour and checks the main ftp.archlinux.org/current and /extra repositories.  It keeps track of each package in there.  When a new package is updated, it downloads it.  Then it looks in one of the other mirrors for the old package that is being replaced, and downloads it.
Then it extracts each package and compares it file by file.  Any file that is new or updated is put in a new folder.  It makes a list of all the new, same, updated, and deleted files.  It then packages all this up into a diffpkg.tar.gz file and stores it on the webserver ready to download (if it is at least 10% smaller than the actual package).  This part is already done although I havent put it on a 30 minutes schedule yet.  Also note, this it is copying any updated or changed file directly.  It is not actually making a diff of each file that is different.

The client Part
The client part is another perl script that would be referenced in the XferCommand setting in /etc/pacman.conf so that it is run first when attempting to download a file.  It will get the package name from the file url, and will look in the users cache to see if there is an older version of the package there.  If so it will try to download a diffpkg from my website.  If there is no diffpkg file there, then it will use wget to download the file using the original url that was passed to it.
If it does get a diffpkg file, then it will open it up, check the pre-requisite package that it was generated from to see if that is the one in the users cache.  If it isn't then it will download the original URL (this could mean extra downloading if the packages are not installed in sequence, but I may have a solution for this).   
It then extracts the cached package into a temporary folder, extracts the diffpkg into another folder, and processes the deleted, updated and new files copying the content into the new package folder.  When it is done, it tarzips it all up and you should now have a package in your cache created from the diff set that is very similar to the package that would have been downloaded in entirety.   Pacman would then install the package as if it had been downloaded.  It shouldnt care if it was downloaded directly or if it was generated from the diffpkg.


Of course, this will only really be of benefit for those package upgrades that really only change or add a few files.   

I just tested this on the neverball package that was just updated.  The old pkg was 9mb, the new pkg was 14mb, the diffpkg was 12mb.  So it would only save you 2mb of download.   Not really a whole lot, but to some people it may be helpfull.


Thoughts?  Questions?  Anyone interested in testing it out when I'm done?

Offline

#2 2004-07-13 20:12:34

ravster
Member
From: Queen's U, Kingston, Canada
Registered: 2004-05-02
Posts: 285
Website

Re: package diffs

It seems a good idea, and by what I understood from the explanation, it basically reduces redundancies in the package upgradation. We would still (maybe) have to use the normal way for adding new packages onto the system.

Overall, I think its a good idea. big_smile

Offline

#3 2004-07-13 21:05:47

sarah31
Member
From: Middle of Canada
Registered: 2002-08-20
Posts: 2,975
Website

Re: package diffs

this has come up several times and there was no satisfactory solution. it requires a bit more work on an already stressed developer core.

my problems with diff is that you would have to have the contents of the diff available for view. otherwise you could have difficulty convinceing people they are secure.

how would you do this?
sanity checks on diffs would be much more difficult on diffs i would think.


AKA uknowme

I am not your friend

Offline

#4 2004-07-13 22:18:13

jlvsimoes
Member
From: portugal
Registered: 2002-12-23
Posts: 392
Website

Re: package diffs

but the concept is the coolest
maybe off side archlinux developers can work on this and then present the fineshed work to arch


-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GU/ d- s: a- C L U P+ L+++ E--- W+
N 0+ K- W-- !O !M V-- PS+ PE- V++ PGP T 5 Z+ R* TV+ B+
DI-- D- G-- e-- h! r++ z+ z*
------END GEEK CODE BLOCK------

Offline

#5 2004-07-13 22:41:16

cjdj
Member
From: Perth, Western Australia
Registered: 2004-05-07
Posts: 121

Re: package diffs

sarah31 wrote:

this has come up several times and there was no satisfactory solution. it requires a bit more work on an already stressed developer core.

Right, because the thought was to change the existing system to accomodate a diff system.  My idea however, is to add-on to the existing system.  It is automatic, and does not require a developer or maintainer to produce the diff set.

sarah31 wrote:

my problems with diff is that you would have to have the contents of the diff available for view. otherwise you could have difficulty convinceing people they are secure.

I'm not sure what you mean.   The files in the diffpkg.tar.gz file is on the webserver, and the contents can be viewed easily. 

sarah31 wrote:

how would you do this?
sanity checks on diffs would be much more difficult on diffs i would think.

Oh, I think I see what you mean.  The files in the diff archive, are not diffs themselves.  They are the entire file, unmodified.  the difference between the package and the diffpackage is that the diffpackage is stripped of the files that were the same from the previous version of that package.

I figured that creating a package of diff files to "patch" existing files on the system, would not work because you never really know what files the user has modified unless you can't tell in advance which files are config or editable files and which ones arent... therefore putting a lot of hassle on maintainers and such.   So I figured why not create a system that compares the old package file against the new package file and just tarzip up the files that have changed or added.

so if you install widget-1.0.1-3.pkg.tar.gz and still have it in your /var/cache/pacman/pkg directory, then a new version widget-1.0.1-4.pkg.tar.gz comes along.  This application downloads widget-1.0.1-4.diffpkg.tar.gz and uses it to build widget-1.0.1.4.pkg.tar.gz locally from the old one in the cache.

There are some big limitations in this situation.
Firstly, if the new version of the package is re-compiled with different options, then all binaries in the package will likely be different.  Therefore a diff set would not be of any use, unless there are other files in the package that were not modified.

Secondly, If the user has cleared their cache, then they wont have the local file to build apon.

and Thirdly enough files need to be duplicated (and removed) to produce a savings in download size.   So I am thinking that some of the larger ones might be suitable, but certainly not all packages.

So I am guessing that this diff system would only be of benefit to some people, most likely on a slow dial-up internet connection, and it will only really be of benefit on the very slight updates to packages.   And since some people would be willing to use bzip to gain an extra little bit of compression to save download time, they might be willing to use this system which would probably only get an equivilent boost.

Offline

#6 2004-07-13 23:16:51

ravster
Member
From: Queen's U, Kingston, Canada
Registered: 2004-05-02
Posts: 285
Website

Re: package diffs

cjdj wrote:

Oh, I think I see what you mean.  The files in the diff archive, are not diffs themselves.  They are the entire file, unmodified.  the difference between the package and the diffpackage is that the diffpackage is stripped of the files that were the same from the previous version of that package.

so if you install widget-1.0.1-3.pkg.tar.gz and still have it in your /var/cache/pacman/pkg directory, then a new version widget-1.0.1-4.pkg.tar.gz comes along.  This application downloads widget-1.0.1-4.diffpkg.tar.gz and uses it to build widget-1.0.1.4.pkg.tar.gz locally from the old one in the cache.

I think I understand it now. Do you mean something like the cvs update command?

cvs update -dP

Offline

#7 2004-07-14 00:20:22

Dusty
Schwag Merchant
From: Medicine Hat, Alberta, Canada
Registered: 2004-01-18
Posts: 5,986
Website

Re: package diffs

I might try it. Seems like a good idea, and a simple implementation.

Real binary diffs and bz2 compression would be great, but as Sarah says, binary diffs would be hard to secure.  On the other hand, what proof do I have that the packages in current or extra are secure? there can be md5sums on diffs too, but it still comes down to trusting that the person that made the sums is honest.

Dusty

Offline

#8 2004-07-14 03:39:28

sarah31
Member
From: Middle of Canada
Registered: 2002-08-20
Posts: 2,975
Website

Re: package diffs

well for one thing there is a sanity check on the downloaded package. you can should be able to trust the build first because of the integrity check but mostly because of the md5sums of that actual source. if one doubts those md5sums then they would have much trust in the openly stated source urls.

no doubt you could md5sum the diffs but how can a user be sure that the diffs are created from authentic source and not the creation of a hacker? if the developers are not making the diffs then how can i trust them?

as you say this also depends on one not dumping their cache and no user modifications of the files installed by the original package ... or would the diff follow a users settings in pacman?

the code should be the same language as pacman as well .. not perl or python.


AKA uknowme

I am not your friend

Offline

#9 2004-07-14 09:12:57

i3839
Member
Registered: 2004-02-04
Posts: 1,185

Re: package diffs

Authentication is simple in theory:

Use the diff to update a package. That should give exactly the same result as you would get when downloading the new package, thus doing a md5sum on the result should give the same checksum as the one of the new package. This way you use the trusted md5sum from the official package, so there is no need at all to trust the diff. Worst case is that you downloaded the diff for nothing.

Only tricky part is to make the resulting package exactly the same as the original packet, because you only have the md5sum of the tgz. Compression isn't really the problem, because that's deterministic with the same compression options, the main problem is to get the files in the right order in the tar I think. If tar can sort the files then it's solved.

Offline

#10 2004-07-14 13:54:05

cjdj
Member
From: Perth, Western Australia
Registered: 2004-05-07
Posts: 121

Re: package diffs

Since the diffpkg's are built by an automated script, my bigger concern is that it does it all correctly.   Since the original packages do not actually contain any md5sums in them, I can only assume that the packages I get are correct.

I've been thinking about how to check that the merged package is equivilent to the original, but dont really have a good solution to that yet.

As for insurining the pedigree of it, once this proof-of-concept has been proven, I would think that it would be best to put it all on the arch servers.  I have no issue with hosting it, but it should be somewhere that is trusted.

Offline

#11 2004-07-14 14:01:10

cjdj
Member
From: Perth, Western Australia
Registered: 2004-05-07
Posts: 121

Re: package diffs

Just to eleviate some confusion.

The concept I outlined is to use a diffpkg to create a new (equivilent) package.   The diffpkg itself is not used to upgrade your system.

For example:
1. You have the current package installed on your system. 

2. A copy of it is in your cache. 

3. A new package comes out. 

4. When you do pacman -Syu it tells you that there is a new package and asks you if you want to upgrade.

5. pacman then calls pacdiff to get the package (same way that you would use wget to get the package). 

6. pacdiff downloads the diffpkg file for the new upgraded package.
builds the new package by merging the cached package and the diffpkg together.  If there is no diffpkg for it, then it will download the whole package.

7. pacdiff tells pacman that the package downloaded correctly.

8. pacman then installs the package.


So this means that your mirrors listed in pacman.conf are still correct.  pacdiff does not need to do anything funky with that.   It is given a url to a package.  It tries to find a diffpkg instead, if none exist, then it goes ahead and downloads from that original url.  if it did find a diffpkg, then it uses that to build a new package that should be close to identicle to the one that would have been downloaded in the first place.

Offline

#12 2004-07-14 14:13:30

cjdj
Member
From: Perth, Western Australia
Registered: 2004-05-07
Posts: 121

Re: package diffs

ok, some real test results.

These are some new packages that were updated recently.

size            file
----------   ----------------------------------
85516        gphoto2-2.1.4-1.pkg.tar.gz
85179        gphoto2-2.1.4-2.pkg.tar.gz
30362        gphoto2-2.1.4-2.diffpkg.tar.gz

1298746      libgphoto2-2.1.4-1.pkg.tar.gz
1304847      libgphoto2-2.1.4-2.pkg.tar.gz
956611       libgphoto2-2.1.4-2.diffpkg.tar.gz

21922        libusb-0.1.8-3.pkg.tar.gz
22814        libusb-0.1.8-4.pkg.tar.gz
10002        libusb-0.1.8-4.diffpkg.tar.gz

9320179      neverball-1.2.5-1.pkg.tar.gz
14771539     neverball-1.3.3-1.pkg.tar.gz
12303153     neverball-1.3.3-1.diffpkg.tar.gz

There were a couple more packages that had less than a 10% reduction so they were automatically discarded (mlterm and mysql-python).

Offline

#13 2004-07-14 14:42:52

lanrat
Member
From: Poland
Registered: 2003-10-28
Posts: 1,274

Re: package diffs

Looks like it works well for the release updates (first 3 examples) and not that well for version updates (last example) - which is exactly what could be expected IMO.

Offline

#14 2004-07-15 03:41:44

cjdj
Member
From: Perth, Western Australia
Registered: 2004-05-07
Posts: 121

Re: package diffs

i3839 wrote:

Only tricky part is to make the resulting package exactly the same as the original packet, because you only have the md5sum of the tgz. Compression isn't really the problem, because that's deterministic with the same compression options, the main problem is to get the files in the right order in the tar I think. If tar can sort the files then it's solved.

This part has proven to indeed be a bit tricky.   Some packages are being re-created with the exact same md5 sum as the original, but others arent.  So I'm still working on it.  I had not realised that a link/unlink combination to move a file will sometimes result in the usr/grp/mode set exactly the same way, and other times not.   I originally used the rename function, but the md5's were way off.  But that might have been caused by something else I fixed, so I'm going to try to use rename again.  Oh, and I havent tried this on a package with symbolic links in it.  Dont know what it would do with those yet.

I'm sure its something simple.  Getting real close.

Offline

#15 2004-07-15 18:08:46

apeiro
Daddy
From: Victoria, BC, Canada
Registered: 2002-08-12
Posts: 771
Website

Re: package diffs

That's a neat idea, cjdj.  I like the fact that you've been able to plug it into pacman without any source-code hacks or server-side modifications.

Cool!  smile

Offline

#16 2004-07-15 18:10:26

apeiro
Daddy
From: Victoria, BC, Canada
Registered: 2002-08-12
Posts: 771
Website

Re: package diffs

What if the user's local package version is two (or more) revisions old?  Does the client just download the full package then?

Offline

#17 2004-07-15 20:38:22

i3839
Member
Registered: 2004-02-04
Posts: 1,185

Re: package diffs

Yes, if the local package is too old then the full package should be downloaded.

The checksum problems:

1) The stat data must be the same. Because you use the old package the files will have a different time for sure.
2) The files must have exactly the same order in the tar file.
3) Maybe the same gzip version must be used as the package is made with.

1 can probably be solved with putting all the stat data of all files in the package in a metadata file of the diff and use touch with that. 2 can be solved similar, with a special file as part of the diff which has the order of all files in the package, this can be combined with solving 1. 3 is hopefully not a real problem, and if it is then the gzip version can be written in de diff metadata file.

An idea is to just do a (real) diff -a between the original tar and the tar you get after adding your diff to the old package. If you have solved the file order problem then the only difference will be the file headers which contain the different stat info and a lot of zeros. This could solve problem 1 in an easy way.

Fixing the problems causes a lot overhead, but all the above wouldn't be needed if the official packages are made diff friendly. Basically they're made as happens now when you do a diff update, the result can be tarred and all problems are solved at once. The disadvantage is that making a new package is more work or more complicated when automated, and that the diff system isn't totally independent anymore.

All in all it's easier to use seperate checksums and just trust cjdj.

Offline

Board footer

Powered by FluxBB