No intermediary deltas are needed, it applies the rsync algorithm via HTTP range requests. It produces an exact match after download (tested with database signatures).
As a binary input files (-Z), 23% saving on extra.db.tar.gz and 90% saving on community,db.tar.gz. There would be more consistent savings if this was applied to the uncompressed files (zsyncmake doesn't like the bsdtar'd archives...).
zsync-gen.sh
#!/bin/bash
for repo in core extra community; do
pushd /path/to/$repo/x86_64
rm -f *zsync
zsyncmake -Z -u $repo.db $repo.db
popd
done
Resulting file sizes
ls -lh */*/*.zsync
-rw-r--r-- 1 jonathon jonathon 19K Mar 22 21:15 community/x86_64/community.db.zsync
-rw-r--r-- 1 jonathon jonathon 633 Mar 22 21:15 core/x86_64/core.db.zsync
-rw-r--r-- 1 jonathon jonathon 5.7K Mar 22 21:15 extra/x86_64/extra.db.zsync
zup.sh
#!/usr/bin/bash
set -euo pipefail
set -x
tempdir=$(mktemp -d)
pushd "$tempdir"
trap "popd; rm -fr $tempdir" EXIT
cp /var/lib/pacman/sync/*.db ./
for repo in core extra community; do
if ! zsync2 -jq https://repo.m2x.dev/current/$repo/x86_64/$repo.db.zsync; then
zsync2 https://repo.m2x.dev/current/$repo/x86_64/$repo.db.zsync
wget https://repo.m2x.dev/current/$repo/x86_64/$repo.db.sig || true
sudo cp -f $repo.db{,.sig} /var/lib/pacman/sync/
fi
done
$ ./zup.sh
++ mktemp -d
+ tempdir=/tmp/tmp.ln8doSdqmM
+ cp -a /var/lib/pacman/sync/community.db /var/lib/pacman/sync/community-testing.db /var/lib/pacman/sync/core.db /var/lib/pacman/sync/extra.db /var/lib/pacman/sync/multilib.db /var/lib/pacman/sync/multilib-testing.db /var/lib/pacman/sync/testing.db /tmp/tmp.ln8doSdqmM/
+ cd /tmp/tmp.ln8doSdqmM
+ for repo in core extra community
+ zsync2 -jq https://$MIRROR/core/x86_64/core.db.zsync
zsync2 version 2.0.0-alpha-1 (commit 86cfd3a), build <local dev build> built on 2021-03-20 21:44:59 UTC
+ echo 1
1
+ for repo in core extra community
+ zsync2 -jq https://$MIRROR/extra/x86_64/extra.db.zsync
zsync2 version 2.0.0-alpha-1 (commit 86cfd3a), build <local dev build> built on 2021-03-20 21:44:59 UTC
+ zsync2 https://$MIRROR/extra/x86_64/extra.db.zsync
zsync2 version 2.0.0-alpha-1 (commit 86cfd3a), build <local dev build> built on 2021-03-20 21:44:59 UTC
Checking for changes...
/tmp/tmp.ln8doSdqmM/extra.db found, using as seed file
Target file: /tmp/tmp.ln8doSdqmM/extra.db
Reading seed file: /tmp/tmp.ln8doSdqmM/extra.db
Usable data from seed files: 23.655914%
Renaming temp file
Fetching remaining blocks
Downloading from https://$MIRROR/extra/x86_64/extra.db
####---------------- 23.8% 1.4 kBps A
####---------------- 23.8% 4.1 kBps
#######------------- 35.1% 144.7 kBps
########------------ 41.9% 696.3 kBps
###########--------- 57.7% 432.8 kBps
#################--- 86.7% 800.1 kBps
###################- 95.9% 917.5 kBps
#################### 100.0% 950.3 kBps DONE
Verifying downloaded file
checksum matches OK
used 450560 local, fetched 1452298
+ wget https://$MIRROR/extra/x86_64/extra.db.sig
--2021-03-22 20:14:50-- https://$MIRROR/extra/x86_64/extra.db.sig
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving $MIRROR ($MIRROR)... $IP6, $IP4
Connecting to $MIRROR ($MIRROR)|$IP6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 584 [application/octet-stream]
Saving to: ‘extra.db.sig’
extra.db.sig 100%[================================================================================================>] 584 --.-KB/s in 0s
2021-03-22 20:14:50 (33.8 MB/s) - ‘extra.db.sig’ saved [584/584]
+ sudo cp -f extra.db extra.db.sig /var/lib/pacman/sync/
[sudo] password for jonathon:
+ for repo in core extra community
+ zsync2 -jq https://$MIRROR/community/x86_64/community.db.zsync
zsync2 version 2.0.0-alpha-1 (commit 86cfd3a), build <local dev build> built on 2021-03-20 21:44:59 UTC
+ zsync2 https://$MIRROR/community/x86_64/community.db.zsync
zsync2 version 2.0.0-alpha-1 (commit 86cfd3a), build <local dev build> built on 2021-03-20 21:44:59 UTC
Checking for changes...
/tmp/tmp.ln8doSdqmM/community.db found, using as seed file
Target file: /tmp/tmp.ln8doSdqmM/community.db
Reading seed file: /tmp/tmp.ln8doSdqmM/community.db
Usable data from seed files: 90.540541%
Renaming temp file
Fetching remaining blocks
Downloading from https://$MIRROR/community/x86_64/community.db
##################-- 90.5% 0.0 kBps
###################- 96.5% 108.3 kBps
#################### 100.0% 344.1 kBps DONE
Verifying downloaded file
checksum matches OK
used 5763072 local, fetched 601231
+ wget https://$MIRROR/community/x86_64/community.db.sig
--2021-03-22 20:15:01-- https://$MIRROR/community/x86_64/community.db.sig
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving $MIRROR ($MIRROR)... $IP6, $IP4
Connecting to $MIRROR ($MIRROR)|$IP6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 584 [application/octet-stream]
Saving to: ‘community.db.sig’
community.db.sig 100%[================================================================================================>] 584 --.-KB/s in 0s
2021-03-22 20:15:01 (41.3 MB/s) - ‘community.db.sig’ saved [584/584]
+ sudo cp -f community.db community.db.sig /var/lib/pacman/sync/
+ rm -fr /tmp/tmp.ln8doSdqmM
$ sudo pacman -Syu
:: Synchronising package databases...
testing is up to date
core is up to date
extra is up to date 0.0 B 0.00 B/s 00:00 [------------------------------------------------------------------] 0%
community-testing is up to date
community is up to date 0.0 B 0.00 B/s 00:00 [------------------------------------------------------------------] 0%
multilib-testing is up to date
multilib is up to date
:: Starting full system upgrade...
(Edit: post number nice)
]]>xdelta3 was used for the package deltas, and there was an RCE vulnerability in the code. That is an issue to consider as well.
The decompression can cause issues with database signatures, if the compression algorithm is not reproducible.
Anyways, the current database format is really not a good fit for incremental updates. It would have to be changed to something more suited for the task.
Edit: I imagine the savings would probably be much greater if package deltas would be reimplemented and used, even while redownloading the whole database a few times per day.
$ ls -lh core.* extra.* community.*
-rw-r--r-- 1 jmcclure jmcclure 5.5M Mar 22 10:09 community.db
-rw-r--r-- 1 jmcclure jmcclure 5.4M Mar 22 10:11 community.patch
-rw-r--r-- 1 jmcclure jmcclure 130.9K Mar 22 10:09 core.db
-rw-r--r-- 1 jmcclure jmcclure 127.8K Mar 22 10:12 core.patch
-rw-r--r-- 1 jmcclure jmcclure 1.6M Mar 22 10:09 extra.db
-rw-r--r-- 1 jmcclure jmcclure 1.6M Mar 22 10:12 extra.patch
Hardly a detectable savings in what needs to be downloaded. Removing the 'h' flag to get the non-rounded numbers the community.patch is only 1.8% smaller than the full community.db. And again, that's from just 20 hours ago!
Generation of these diffs on hardware that likely has higher specs than many mirrors takes *far* longer than downloading the original db likely would for even pretty limited connections. So the bandwidth savings is in the range of a couple of percent, the time savings is non-existent as the download will take *much* longer, and the resource demands on the mirror servers are significant. Unless there is a totally different approach to this, it looks like everyone loses with delta-databases.
You could argue that the server would not generate these on the fly, but rather store all pairwise diffs. This would eliminate most of the processing load on the mirror server and also bring the download time down to just the actual downloading process. But still we're saving only a couple percent of bandwidth and now the server has to store a substantial number of patches. *many* per day for ... how many days should they be kept?
I don't know how many times per day a database is changed, but it's not trivial. And to support these delta-databases with stored patchfiles going back even week or so, I imagine we'd then need several GB storage just for the db patch files (if not more).
EDIT: see below on why this may not be the best test (in hindsight a failure of getting a reasonable patch for compressed data should have been obvious). Xdelta3 indeed produces *much* smaller diffs and quite a bit faster. So I suppose much of this critique is invalid, but I'd still question the value given the costs.
]]>What I mean is we ask mirror to only give us updates since an update id, not in small files, just in one file as it gives now, but that file is not the entire db, it is only difference since that update id
Now you need a dynamic mirror instead of a static one and you have to balance mirror load caused by the archive creation vs the bandwidth used. Also add the time required to generate the diff, that may have a negative impact for small databases. Storage space required is also greater, but probably insignificant in contrast to the packages.
The result would be something like git, but maybe with a way to truncate the old history.
]]>Do you know how many times a day the Arch package databases are updated? You would end up having to download hundreds of small files, which would be more effort than download the entire database. Removing signatures from the databases would be much better. Perhaps even reorganising databases such that you don't need to download the (e.g.) entire haskell package list when you have no intention of using it.
Nah I didn't mean't downloading tons of small files
What I mean is we ask mirror to only give us updates since an update id, not in small files, just in one file as it gives now, but that file is not the entire db, it is only difference since that update id
Conflicts and issues with old mirrors and clients ? at least it can be optional or with a switch in command, Like -SyuZ instead of -Syu (Z is the switch which is not used in pacman), so the mirror can know we support delta update, or we can know whether mirror supports that feature or not (by the way it corresponds to our parameter in request to it, caused by -Z)
]]>but why would you want to use it for the database files? They are tiny anyways.
Sure they are tiny, but 10KB is even tinier than 6MB, it also saves bandwidth on the server side too, server can serve ~100KB to 60 people instead of 6MB to 1 person
]]>There was also some discussion in the arch-dev-public mail list to use detached signatures in the pacman's database files would reduce their size considerably (80% reduction)... I have not followed it up. Link to discussion below.
That can happen after pacman-6.0 is released. Or probably 6.1 where that will be the default for databases.
]]>It sounds like you are looking for delta upgrades which exists (in some state, I've not followed it) for *packages*
That used to exist, it was removed in pacman 5.2
http://allanmcrae.com/2019/10/pacman-5-2-release/
This is also a burden on the server to either A) store a patch for the current state against a large number of previous states, or B) generate a new patch on the fly every time someone syncs their databases.
]]>When you write "mirror list", I believe you are really talking about the package DB files in /var/log/pacman/sync, no? Because mirror lists are something else…
If yes, please have a look at how the package DB files are built up. They do not contain a "list of updates", but just a list of package metadata for each package currently part of that repository.
And when you are mentioning a "hashes" and "changes", I believe you have an entirely different concept in mind (for instance… Git commits?)—but as mentioned, package DBs do not describe updates or "change logs"; they simply describe repository content.I guess it could be possible to save bandwidth by introducing a mechanism that somehow checks only for changes, but that would introduce a lot of complexity in to pacman and it's DB synchronisation procedure.
I think using rsync mirrors would be a much better approach: it's a mechanism that already exists, and it delegates the issue of saving network bandwidth to a dedicated tool (rsync), thereby keeping the individual components simple—in good old Unix fashion.
The whole idea is only fetching changes since last time fetched, not the entire package DB
]]>If yes, please have a look at how the package DB files are built up. They do not contain a "list of updates", but just a list of package metadata for each package currently part of that repository.
And when you are mentioning a "hashes" and "changes", I believe you have an entirely different concept in mind (for instance… Git commits?)—but as mentioned, package DBs do not describe updates or "change logs"; they simply describe repository content.
I guess it could be possible to save bandwidth by introducing a mechanism that somehow checks only for changes, but that would introduce a lot of complexity in to pacman and it's DB synchronisation procedure.
I think using rsync mirrors would be a much better approach: it's a mechanism that already exists, and it delegates the issue of saving network bandwidth to a dedicated tool (rsync), thereby keeping the individual components simple—in good old Unix fashion.