You are not logged in.
Hi, I'm looking for the best way for listing all existing packages of the arch linux os and their related versions and artifacts.
This is related to the software heritage projects ( https://archive.softwareheritage.org/ ) which ambition is to collect, preserve, and share all software that is publicly available in source code form.
Basically I would like to get a list of package with versions, something like :
{name: pkgname, versions ["0.2.1", "0.2.0"]}
Additionaly if there is field like "last_update", "branch", "target", "archive_name", "archive_url", "checksum", etc, that's better!
I can see at least two way to go, the goal is to have the lightest and most resilient pattern to stay up to date.
A - Cloning the main repository once, extracting data from PKGBUILD files, and then regularly svn/git fetching to get the differential
B - Scrapping the json api with something like, GET https://archlinux.org/packages/search/j … ate&page=1 . For differential, browse til previous last_update date
What do you think? Is there another way to get up to date?
Also is there any guideline around scrapping and api comsuption frequency ?
Offline
The best way is to extract information from the binary package manager database which is available on any Arch mirror. This is trivial if you installed Arch Linux (just run "pacman -Ss" or "pacman -Si pkgname"). There are also Python bindings which are somewhat more flexible.
Offline
At first glance, running pacman seems impossible to me, but maybe i'm wrong, will check with team if it's an option.
We usually work with api or dvcs repository, plain text listing, etc and try to make as low bandwith consumption and http call as we can. and low dependencies as we can as well.
btw the project is python based, so using the lib is maybe an option too.
If we go the pacman way, is it possible to list all new released version since a date for example ?
And sorry for having so much questions, but what looks bad from you point of view about the A solution(ie: svn/git repository), isn't it a trusted source?
Offline
The largest binary database file, community.db.tar.gz, is just about 7 MB. I don't know how large the svn/git repository is, but my guess is that it's orders of magnitude larger. Also the binary database is structured, you wouldn't need to implement your own PKGBUILD parser for example.
If we go the pacman way, is it possible to list all new released version since a date for example ?
pacman itself does not have such filter, but the last update timestamp is stored in the database, so you can use it in a custom filter.
Last edited by lahwaacz (2022-04-12 13:41:16)
Offline
The pacman database is a tarball that contains flat files (with package information) that can be easily parsed in python. I recall that someone even wrote a python-library for that and posted it this forum.
Offline
If we go the pacman way, is it possible to list all new released version since a date for example ?
The pacman DB only contains details for the currently available versions of a package, you'd have to keep your own history. There is no way to get the history from before you first downloaded the DB.
Offline
Ah, thanks, looks like a good option, will explore this too.
I have the core repository on my disk and for comparison its 36M against 156kb for core.db.tar.gz
Offline
I doubt you really want literally just the 'core' repository. Arch linux has one repository actually named, "core", but it would never be used on it's own (it might be theoretically possible, but it'd not be supported, and that's just not how it's meant to be used). In the more general meaning of the word, arch linux has three core repos that are the default or base for the distro: core, extra, and community.
For your "new releases since a given date" question, whether you can answer this with just the repo database depends on exactly what is meant by the question. For example, if you want to know every incremental update to package X that has happened in the last 6 months, that will not be in the database file. However, if you just want to know every package that has been updated within the past 6 months, that *is* in the database. Specifically, the database will include the *current* version of every package along with the date it was built / added to the repo, so this can be used to filter out any packages that have not been updated since a given date (thus leaving those that have been).
Last edited by Trilby (2022-04-12 14:12:20)
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
can be easily parsed in python. I recall that someone even wrote a python-library for that and posted it this forum.
before, download *.db files from one mirror
https://bbs.archlinux.org/viewtopic.php … 4#p1969414
Last edited by papajoke (2022-04-12 16:49:23)
lts - zsh - Kde - Intel Core i3 - 6Go RAM - GeForce 405 video-nouveau
Offline
a821 wrote:can be easily parsed in python. I recall that someone even wrote a python-library for that and posted it this forum.
before, download *.db files from one mirror
https://bbs.archlinux.org/viewtopic.php … 4#p1969414
Thanks, I was sure I had seen it posted here somewhere
Offline
The Arch Linux Archive contains all released versions of a package since it was released, e.g. https://archive.archlinux.org/packages/a/abiword/
Apart from the checksums, you can get the information you need directly from there.
Mods are just community members who have the occasionally necessary option to move threads around and edit posts. -- Trilby
Offline
I doubt you really want literally just the 'core' repository. Arch linux has one repository actually named, "core", but it would never be used on it's own (it might be theoretically possible, but it'd not be supported, and that's just not how it's meant to be used). In the more general meaning of the word, arch linux has three core repos that are the default or base for the distro: core, extra, and community.
Right, so Arch repos are core + extra + community
For your "new releases since a given date" question, whether you can answer this with just the repo database depends on exactly what is meant by the question. For example, if you want to know every incremental update to package X that has happened in the last 6 months, that will not be in the database file. However, if you just want to know every package that has been updated within the past 6 months, that *is* in the database. Specifically, the database will include the *current* version of every package along with the date it was built / added to the repo, so this can be used to filter out any packages that have not been updated since a given date (thus leaving those that have been).
Behind the scene, discovering listing of all package and existing version is a two step process. First get whole list through a database, api, dvcs, or whatever we can trust. This step is done once and set a last visit date.
Second step, regularly launch an update comparing what's new since last visit.
In a dvcs it's quite easy to get a list of changed file since a date.
With an api it depends, but generally speaking there are date fields.
If there is no way to get more than the last 6 months versions, well it's ok, we must start from somewhere :-)
Offline
Interesting, Thanks!
a821 wrote:can be easily parsed in python. I recall that someone even wrote a python-library for that and posted it this forum.
before, download *.db files from one mirror
https://bbs.archlinux.org/viewtopic.php … 4#p1969414
Offline
Given your goals I suspect the svn/git repos of the build files would be best. This will have all the history of every package in the given repo. So you'd clone three svn (or git mirror) repos of core, extra, and community, then all the data you need is there.
The archlinux archive was mentioned, but you'd have to iterate through *every* individual package to get a list of dates/versions from different urls for each package. For a single package, this is great - but for all packages it's be a bad idea.
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
Good point. I will experiment with svn/git first. Will put my progress here.
Given your goals I suspect the svn/git repos of the build files would be best. This will have all the history of every package in the given repo. So you'd clone three svn (or git mirror) repos of core, extra, and community, then all the data you need is there.
The archlinux archive was mentioned, but you'd have to iterate through *every* individual package to get a list of dates/versions from different urls for each package. For a single package, this is great - but for all packages it's be a bad idea.
Offline