You are not logged in.

#1 2021-03-21 11:47:12

erfanjoker
Member
From: Tabriz / Iran
Registered: 2017-03-26
Posts: 168
Website

Pacman and AUR Helpers Improvement

I was wondering where to create this topic, I hope it's the right place,

In normal use, when we use pacman -Syu or any other AUR Helper like yay -Syu or etc, they will check if packages db have been updated since we last -Syu'ed , and if changed, they will start re-downloading the entire mirror package db for example 130KB for core or 1.6MB for extra or 6MB for community or etc ...

Now my question is, Why the entire mirror package db ? Wouldn't it be better if each changelog in package db have a hash or id, and when we -Syu, we save the last update hash, next time we ask update with our last -Syu hash, so the server only give us the diff between our last update and the new available updates ? instead of re-downloading the entire mirror packages db ? This would save traffic for client side, also saves bandwidth in the mirror side, so you dont have to re-download 6MB package db only for 2KB diff !

Last edited by erfanjoker (2021-03-22 15:05:16)

Offline

#2 2021-03-21 12:13:38

ayekat
Member
Registered: 2011-01-17
Posts: 1,435
Website

Re: Pacman and AUR Helpers Improvement

When you write "mirror list", I believe you are really talking about the package DB files in /var/log/pacman/sync, no? Because mirror lists are something else…

If yes, please have a look at how the package DB files are built up. They do not contain a "list of updates", but just a list of package metadata for each package currently part of that repository.
And when you are mentioning a "hashes" and "changes", I believe you have an entirely different concept in mind (for instance… Git commits?)—but as mentioned, package DBs do not describe updates or "change logs"; they simply describe repository content.

I guess it could be possible to save bandwidth by introducing a mechanism that somehow checks only for changes, but that would introduce a lot of complexity in to pacman and it's DB synchronisation procedure.
I think using rsync mirrors would be a much better approach: it's a mechanism that already exists, and it delegates the issue of saving network bandwidth to a dedicated tool (rsync), thereby keeping the individual components simple—in good old Unix fashion. smile


{,META,RE}PKGBUILDSpacman-hacks (includes makemetapkg and remakepkg) │ dotfiles

Offline

#3 2021-03-21 12:40:24

erfanjoker
Member
From: Tabriz / Iran
Registered: 2017-03-26
Posts: 168
Website

Re: Pacman and AUR Helpers Improvement

ayekat wrote:

When you write "mirror list", I believe you are really talking about the package DB files in /var/log/pacman/sync, no? Because mirror lists are something else…

If yes, please have a look at how the package DB files are built up. They do not contain a "list of updates", but just a list of package metadata for each package currently part of that repository.
And when you are mentioning a "hashes" and "changes", I believe you have an entirely different concept in mind (for instance… Git commits?)—but as mentioned, package DBs do not describe updates or "change logs"; they simply describe repository content.

I guess it could be possible to save bandwidth by introducing a mechanism that somehow checks only for changes, but that would introduce a lot of complexity in to pacman and it's DB synchronisation procedure.
I think using rsync mirrors would be a much better approach: it's a mechanism that already exists, and it delegates the issue of saving network bandwidth to a dedicated tool (rsync), thereby keeping the individual components simple—in good old Unix fashion. smile

The whole idea is only fetching changes since last time fetched, not the entire package DB

Offline

#4 2021-03-21 13:46:14

ayekat
Member
Registered: 2011-01-17
Posts: 1,435
Website

Re: Pacman and AUR Helpers Improvement

Yes… so what's lacking with rsync?


{,META,RE}PKGBUILDSpacman-hacks (includes makemetapkg and remakepkg) │ dotfiles

Offline

#5 2021-03-21 13:50:55

Trilby
Inspector Parrot
Registered: 2011-11-29
Posts: 25,180

Re: Pacman and AUR Helpers Improvement

It sounds like you are looking for delta upgrades which exists (in some state, I've not followed it) for *packages*, but why would you want to use it for the database files?  They are tiny anyways.

This is also a burden on the server to either A) store a patch for the current state against a large number of previous states, or B) generate a new patch on the fly every time someone syncs their databases.

Last edited by Trilby (2021-03-21 13:52:13)


"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" -  Richard Stallman

Offline

#6 2021-03-21 14:06:24

progandy
Member
Registered: 2012-05-17
Posts: 4,227

Re: Pacman and AUR Helpers Improvement

Trilby wrote:

It sounds like you are looking for delta upgrades which exists (in some state, I've not followed it) for *packages*

That used to exist, it was removed in pacman 5.2
http://allanmcrae.com/2019/10/pacman-5-2-release/


| alias CUTF='LANG=en_XX.UTF-8@POSIX ' |

Offline

#7 2021-03-21 18:08:54

a821
Member
Registered: 2012-10-31
Posts: 240

Re: Pacman and AUR Helpers Improvement

There was also some discussion in the arch-dev-public mail list to use detached signatures in the pacman's database files would reduce their size considerably (80% reduction)... I have not followed it up. Link to discussion below.

https://lists.archlinux.org/pipermail/a … 30024.html

Offline

#8 2021-03-21 21:59:33

Allan
Member
From: Brisbane, AU
Registered: 2007-06-09
Posts: 11,001
Website

Re: Pacman and AUR Helpers Improvement

a821 wrote:

There was also some discussion in the arch-dev-public mail list to use detached signatures in the pacman's database files would reduce their size considerably (80% reduction)... I have not followed it up. Link to discussion below.

That can happen after pacman-6.0 is released.   Or probably 6.1 where that will be the default for databases.

Offline

#9 2021-03-22 12:47:48

erfanjoker
Member
From: Tabriz / Iran
Registered: 2017-03-26
Posts: 168
Website

Re: Pacman and AUR Helpers Improvement

Trilby wrote:

but why would you want to use it for the database files?  They are tiny anyways.

Sure they are tiny, but 10KB is even tinier than 6MB, it also saves bandwidth on the server side too, server can serve ~100KB to 60 people instead of 6MB to 1 person

Offline

#10 2021-03-22 12:57:47

erfanjoker
Member
From: Tabriz / Iran
Registered: 2017-03-26
Posts: 168
Website

Re: Pacman and AUR Helpers Improvement

The whole concept is :
* Each change in packages db contain an id, maybe Auto Increment id would be better
* Each time we update our packages db, we save the last package's update id
* Next time we ask for updates only that are newer than our last update id, so the mirror only return the delta, not the entire database

Conflicts and issues with old mirrors and clients ? at least it can be optional or with a switch in command, Like -SyuZ instead of -Syu (Z is the switch which is not used in pacman), so the mirror can know we support delta update, or we can know whether mirror supports that feature or not (by the way it corresponds to our parameter in request to it, caused by -Z)

Last edited by erfanjoker (2021-03-22 12:58:50)

Offline

#11 2021-03-22 13:22:30

Allan
Member
From: Brisbane, AU
Registered: 2007-06-09
Posts: 11,001
Website

Re: Pacman and AUR Helpers Improvement

Do you know how many times a day the Arch package databases are updated? You would end up having to download hundreds of small files, which would be more effort than download the entire database.  Removing signatures from the databases would be much better.  Perhaps even reorganising databases such that you don't need to download the (e.g.) entire haskell package list when you have no intention of using it.

Offline

#12 2021-03-22 15:03:38

erfanjoker
Member
From: Tabriz / Iran
Registered: 2017-03-26
Posts: 168
Website

Re: Pacman and AUR Helpers Improvement

Allan wrote:

Do you know how many times a day the Arch package databases are updated? You would end up having to download hundreds of small files, which would be more effort than download the entire database.  Removing signatures from the databases would be much better.  Perhaps even reorganising databases such that you don't need to download the (e.g.) entire haskell package list when you have no intention of using it.

Nah I didn't mean't downloading tons of small files
What I mean is we ask mirror to only give us updates since an update id, not in small files, just in one file as it gives now, but that file is not the entire db, it is only difference since that update id

Offline

#13 2021-03-22 15:07:06

progandy
Member
Registered: 2012-05-17
Posts: 4,227

Re: Pacman and AUR Helpers Improvement

erfanjoker wrote:

What I mean is we ask mirror to only give us updates since an update id, not in small files, just in one file as it gives now, but that file is not the entire db, it is only difference since that update id

Now you need a dynamic mirror instead of a static one and you have to balance mirror load caused by the archive creation vs the bandwidth used. Also add the time required to generate the diff, that may have a negative impact for small databases. Storage space required is also greater, but probably insignificant in contrast to the packages.

The result would be something like git, but maybe with a way to truncate the old history.

Last edited by progandy (2021-03-22 15:11:27)


| alias CUTF='LANG=en_XX.UTF-8@POSIX ' |

Offline

#14 2021-03-22 15:17:31

Trilby
Inspector Parrot
Registered: 2011-11-29
Posts: 25,180

Re: Pacman and AUR Helpers Improvement

FYI, I just copied my existing sync dbs (for core, extra, and community) to /tmp/ - these were from an update just over 20 hours ago.  I then ran and upgrade and `bsdiffed` the old and new to a <repo>.patch file for each.  As an aside, this is a non-trivial process, especially for the community.db, it took a good 10 seconds or so on my hardware to generate that patch; there may be tools better at this than bsdiff, so this may not be that important.  But what is important is the result:

$ ls -lh core.* extra.* community.*
-rw-r--r--    1 jmcclure jmcclure    5.5M Mar 22 10:09 community.db
-rw-r--r--    1 jmcclure jmcclure    5.4M Mar 22 10:11 community.patch
-rw-r--r--    1 jmcclure jmcclure  130.9K Mar 22 10:09 core.db
-rw-r--r--    1 jmcclure jmcclure  127.8K Mar 22 10:12 core.patch
-rw-r--r--    1 jmcclure jmcclure    1.6M Mar 22 10:09 extra.db
-rw-r--r--    1 jmcclure jmcclure    1.6M Mar 22 10:12 extra.patch

Hardly a detectable savings in what needs to be downloaded.  Removing the 'h' flag to get the non-rounded numbers the community.patch is only 1.8% smaller than the full community.db.  And again, that's from just 20 hours ago!

Generation of these diffs on hardware that likely has higher specs than many mirrors takes *far* longer than downloading the original db likely would for even pretty limited connections.  So the bandwidth savings is in the range of a couple of percent, the time savings is non-existent as the download will take *much* longer, and the resource demands on the mirror servers are significant.  Unless there is a totally different approach to this, it looks like everyone loses with delta-databases.

You could argue that the server would not generate these on the fly, but rather store all pairwise diffs.  This would eliminate most of the processing load on the mirror server and also bring the download time down to just the actual downloading process.  But still we're saving only a couple percent of bandwidth and now the server has to store a substantial number of patches.  *many* per day for ... how many days should they be kept?

I don't know how many times per day a database is changed, but it's not trivial.  And to support these delta-databases with stored patchfiles going back even week or so, I imagine we'd then need several GB storage just for the db patch files (if not more).

EDIT: see below on why this may not be the best test (in hindsight a failure of getting a reasonable patch for compressed data should have been obvious).  Xdelta3 indeed produces *much* smaller diffs and quite a bit faster.  So I suppose much of this critique is invalid, but I'd still question the value given the costs.

Last edited by Trilby (2021-03-22 18:00:40)


"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" -  Richard Stallman

Offline

#15 2021-03-22 15:35:12

progandy
Member
Registered: 2012-05-17
Posts: 4,227

Re: Pacman and AUR Helpers Improvement

bsdiff has a much better size if you decompress the database first, but it is still very slow (>7 seconds). xdelta3 is much faster (<1s) and does the decompression automatically. The community diff between 2021/03/20 and 2021/03/22 is about 900kB with xdelta3, and 600kB with bsdiff and uncompressed data.

xdelta3 was used for the package deltas, and there was an RCE vulnerability in the code. That is an issue to consider as well.

The decompression can cause issues with database signatures, if the compression algorithm is not reproducible.

Anyways, the current database format is really not a good fit for incremental updates. It would have to be changed to something more suited for the task.
Edit: I imagine the savings would probably be much greater if package deltas would be reimplemented and used, even while redownloading the whole database a few times per day.

Last edited by progandy (2021-03-22 15:44:12)


| alias CUTF='LANG=en_XX.UTF-8@POSIX ' |

Offline

#16 2021-03-22 20:56:42

jonathon
Member
Registered: 2016-09-19
Posts: 90

Re: Pacman and AUR Helpers Improvement

I've been playing with zsync for a couple of weeks, and this seems like another potential application so I put together a very quick-and-dirty PoC.

No intermediary deltas are needed, it applies the rsync algorithm via HTTP range requests. It produces an exact match after download (tested with database signatures).

As a binary input files (-Z), 23% saving on extra.db.tar.gz and 90% saving on community,db.tar.gz. There would be more consistent savings if this was applied to the uncompressed files (zsyncmake doesn't like the bsdtar'd archives...).

zsync-gen.sh

#!/bin/bash
for repo in core extra community; do
        pushd /path/to/$repo/x86_64
        rm -f *zsync
        zsyncmake -Z -u $repo.db $repo.db                                                                                                                                           
        popd
done

Resulting file sizes

ls -lh */*/*.zsync
-rw-r--r-- 1 jonathon jonathon  19K Mar 22 21:15 community/x86_64/community.db.zsync
-rw-r--r-- 1 jonathon jonathon  633 Mar 22 21:15 core/x86_64/core.db.zsync
-rw-r--r-- 1 jonathon jonathon 5.7K Mar 22 21:15 extra/x86_64/extra.db.zsync

zup.sh

#!/usr/bin/bash
set -euo pipefail
set -x

tempdir=$(mktemp -d)
pushd "$tempdir"
trap "popd; rm -fr $tempdir" EXIT

cp /var/lib/pacman/sync/*.db ./
for repo in core extra community; do
        if ! zsync2 -jq https://repo.m2x.dev/current/$repo/x86_64/$repo.db.zsync; then
                zsync2 https://repo.m2x.dev/current/$repo/x86_64/$repo.db.zsync
                wget https://repo.m2x.dev/current/$repo/x86_64/$repo.db.sig || true
                sudo cp -f $repo.db{,.sig} /var/lib/pacman/sync/
        fi
done

$ ./zup.sh

++ mktemp -d
+ tempdir=/tmp/tmp.ln8doSdqmM
+ cp -a /var/lib/pacman/sync/community.db /var/lib/pacman/sync/community-testing.db /var/lib/pacman/sync/core.db /var/lib/pacman/sync/extra.db /var/lib/pacman/sync/multilib.db /var/lib/pacman/sync/multilib-testing.db /var/lib/pacman/sync/testing.db /tmp/tmp.ln8doSdqmM/
+ cd /tmp/tmp.ln8doSdqmM
+ for repo in core extra community
+ zsync2 -jq https://$MIRROR/core/x86_64/core.db.zsync
zsync2 version 2.0.0-alpha-1 (commit 86cfd3a), build <local dev build> built on 2021-03-20 21:44:59 UTC
+ echo 1
1
+ for repo in core extra community
+ zsync2 -jq https://$MIRROR/extra/x86_64/extra.db.zsync
zsync2 version 2.0.0-alpha-1 (commit 86cfd3a), build <local dev build> built on 2021-03-20 21:44:59 UTC
+ zsync2 https://$MIRROR/extra/x86_64/extra.db.zsync
zsync2 version 2.0.0-alpha-1 (commit 86cfd3a), build <local dev build> built on 2021-03-20 21:44:59 UTC
Checking for changes...
/tmp/tmp.ln8doSdqmM/extra.db found, using as seed file
Target file: /tmp/tmp.ln8doSdqmM/extra.db
Reading seed file: /tmp/tmp.ln8doSdqmM/extra.db
Usable data from seed files: 23.655914%
Renaming temp file
Fetching remaining blocks
Downloading from https://$MIRROR/extra/x86_64/extra.db

####---------------- 23.8% 1.4 kBps         A  
####---------------- 23.8% 4.1 kBps         
#######------------- 35.1% 144.7 kBps           
########------------ 41.9% 696.3 kBps         
###########--------- 57.7% 432.8 kBps         
#################--- 86.7% 800.1 kBps           
###################- 95.9% 917.5 kBps         
#################### 100.0% 950.3 kBps DONE    

Verifying downloaded file
checksum matches OK
used 450560 local, fetched 1452298
+ wget https://$MIRROR/extra/x86_64/extra.db.sig
--2021-03-22 20:14:50--  https://$MIRROR/extra/x86_64/extra.db.sig
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving $MIRROR ($MIRROR)... $IP6, $IP4
Connecting to $MIRROR ($MIRROR)|$IP6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 584 [application/octet-stream]
Saving to: ‘extra.db.sig’

extra.db.sig                                  100%[================================================================================================>]     584  --.-KB/s    in 0s      

2021-03-22 20:14:50 (33.8 MB/s) - ‘extra.db.sig’ saved [584/584]

+ sudo cp -f extra.db extra.db.sig /var/lib/pacman/sync/
[sudo] password for jonathon: 
+ for repo in core extra community
+ zsync2 -jq https://$MIRROR/community/x86_64/community.db.zsync
zsync2 version 2.0.0-alpha-1 (commit 86cfd3a), build <local dev build> built on 2021-03-20 21:44:59 UTC
+ zsync2 https://$MIRROR/community/x86_64/community.db.zsync
zsync2 version 2.0.0-alpha-1 (commit 86cfd3a), build <local dev build> built on 2021-03-20 21:44:59 UTC
Checking for changes...
/tmp/tmp.ln8doSdqmM/community.db found, using as seed file
Target file: /tmp/tmp.ln8doSdqmM/community.db
Reading seed file: /tmp/tmp.ln8doSdqmM/community.db
Usable data from seed files: 90.540541%
Renaming temp file
Fetching remaining blocks
Downloading from https://$MIRROR/community/x86_64/community.db

##################-- 90.5% 0.0 kBps         
###################- 96.5% 108.3 kBps           
#################### 100.0% 344.1 kBps DONE    

Verifying downloaded file
checksum matches OK
used 5763072 local, fetched 601231
+ wget https://$MIRROR/community/x86_64/community.db.sig
--2021-03-22 20:15:01--  https://$MIRROR/community/x86_64/community.db.sig
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving $MIRROR ($MIRROR)... $IP6, $IP4
Connecting to $MIRROR ($MIRROR)|$IP6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 584 [application/octet-stream]
Saving to: ‘community.db.sig’

community.db.sig                              100%[================================================================================================>]     584  --.-KB/s    in 0s      

2021-03-22 20:15:01 (41.3 MB/s) - ‘community.db.sig’ saved [584/584]

+ sudo cp -f community.db community.db.sig /var/lib/pacman/sync/
+ rm -fr /tmp/tmp.ln8doSdqmM

$ sudo pacman -Syu

:: Synchronising package databases...
 testing is up to date
 core is up to date
 extra is up to date                                                                0.0   B  0.00   B/s 00:00 [------------------------------------------------------------------]   0%
 community-testing is up to date
 community is up to date                                                            0.0   B  0.00   B/s 00:00 [------------------------------------------------------------------]   0%
 multilib-testing is up to date
 multilib is up to date
:: Starting full system upgrade...

(Edit: post number nice)

Last edited by jonathon (2021-03-28 12:45:47)

Offline

Board footer

Powered by FluxBB