You are not logged in.
Hello, everybody
I need to build a list of open source software projects and I thought about using the aur database. However, I need the license and popularity information for the packages. The best I could find so far was this file: https://aur.archlinux.org/packages.gz. However, the file only contains the package names.
I thought about scraping the site to get this information, but the amount of requests may affect the servers of flag my ip as suspicious, so I'm looking for a better way to get this data.
Is it possible to download the package details from all packages in aur?
Offline
Is it possible to download the package details from all packages in aur?
In theory, you might want to use the AUR RPC. In reality, however:
Searches will fail if they contain 5000 or more results.
And as there is no "offset" for the list, you'd not even be able to get them all in numerically chunked results. You could try your luck and getting packages starting with each letter of the alphabet (and number) assuming that each of these lists would be fewer than 5000 (which I'm pretty sure would be a valid assumption).
However, you'd likely be better of using pkgstats as that API has 'limit' and 'offset' options making it much easier to chunk result lists and retrieve everything by looping over chunks.
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
The AUR contains around 79,035 packages. At 4999 packages per query, 16 queries should suffice. You could spread them over a day, 3500 packages or so per hour, looping over your list from packages.gz.You could spread this over a week, if you want to be super gentle with the API. After one pass, you get another packages.gz and query the diff. You could also filter out all the -git and -bins for all packages that have one without a suffix.
Offline
You can download the metadata of the packages in the AUR: See https://lists.archlinux.org/pipermail/a … 36659.html
In your particular case I think you need: https://aur.archlinux.org/packages-meta-v1.json.gz
Offline
The AUR contains around 79,035 packages. At 4999 packages per query, 16 queries should suffice.
But what would those 16 queries be? Given that there is no "offset" parameter, you cannot retreive 4999 then pick up in the next query with the 5000th.
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
I'm a bit confused by the question. First things first, I was wrong about the limits.
https://wiki.archlinux.org/title/Aurweb … ace#info_2
Limitations
HTTP GET requests are limited to URI of 8190 bytes maximum length. However, the official AUR instance running on a nginx server with HTTP/2 uses the default URI maximum length limit of 4443 bytes. Info requests with more than about 200 packages as an argument will need to be split.
Search queries must be at least two characters long.
Searches will fail if they contain 5000 or more results.
The API rate is limited to a maximum of 4000 requests per day per IP.
If you take the packages.gz file, you'll have a list of all packages. You can then iterate over this list in steps of 200 packages (lines) and ask for "/rpc/?v=5&type=info&arg[]=pkg1&arg[]=pkg2&…" as described in the interface reference. At ~80k packages, thats's 400 requests. The request limit is 4000.
Trilby, if you're worried about new packages coming in and old packages going out, one could take store packages.gz and diff it against a new version in the future (or directly after the transmission).
Given that we're even talking about the same thing right now.
You can download the metadata of the packages in the AUR: See https://lists.archlinux.org/pipermail/a … 36659.html
In your particular case I think you need: https://aur.archlinux.org/packages-meta-v1.json.gz
Or you do this. Wow.
{"ID":977041,"Name":"010editor","PackageBaseID":116651,"PackageBase":"010editor","Version":"12.0.1-1","Description":"Professional text and hex editing with Binary Templates technology","URL":"https://www.sweetscape.com/010editor/","NumVotes":14,"Popularity":0.020697,"OutOfDate":null,"Maintainer":"Zrax","FirstSubmitted":1477968580,"LastModified":1634742336,"URLPath":"/cgit/aur.git/snapshot/010editor.tar.gz"},
If you don't need dependencies and whatnot, this should be plenty.
Last edited by Awebb (2022-04-04 13:50:35)
Offline
If you take the packages.gz file, you'll have a list of all packages. You can then iterate over this list in steps...
That would be a good way of getting around the lack of 'offset' parameters. Though assuming that metadata source works, this is all a bit tangential.
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
You can download the metadata of the packages in the AUR: See https://lists.archlinux.org/pipermail/a … 36659.html
In your particular case I think you need: https://aur.archlinux.org/packages-meta-v1.json.gz
Thank you very much. That's almost what I need. I don't see the license information in the metadata, but it will be helpful.
Offline
I'm a bit confused by the question. First things first, I was wrong about the limits.
https://wiki.archlinux.org/title/Aurweb … ace#info_2
Limitations
HTTP GET requests are limited to URI of 8190 bytes maximum length. However, the official AUR instance running on a nginx server with HTTP/2 uses the default URI maximum length limit of 4443 bytes. Info requests with more than about 200 packages as an argument will need to be split.
Search queries must be at least two characters long.
Searches will fail if they contain 5000 or more results.
The API rate is limited to a maximum of 4000 requests per day per IP.If you take the packages.gz file, you'll have a list of all packages. You can then iterate over this list in steps of 200 packages (lines) and ask for "/rpc/?v=5&type=info&arg[]=pkg1&arg[]=pkg2&…" as described in the interface reference. At ~80k packages, thats's 400 requests. The request limit is 4000.
Trilby, if you're worried about new packages coming in and old packages going out, one could take store packages.gz and diff it against a new version in the future (or directly after the transmission).
Given that we're even talking about the same thing right now.
a821 wrote:You can download the metadata of the packages in the AUR: See https://lists.archlinux.org/pipermail/a … 36659.html
In your particular case I think you need: https://aur.archlinux.org/packages-meta-v1.json.gz
Or you do this. Wow.
{"ID":977041,"Name":"010editor","PackageBaseID":116651,"PackageBase":"010editor","Version":"12.0.1-1","Description":"Professional text and hex editing with Binary Templates technology","URL":"https://www.sweetscape.com/010editor/","NumVotes":14,"Popularity":0.020697,"OutOfDate":null,"Maintainer":"Zrax","FirstSubmitted":1477968580,"LastModified":1634742336,"URLPath":"/cgit/aur.git/snapshot/010editor.tar.gz"},
If you don't need dependencies and whatnot, this should be plenty.
Thank you for the detailed response. I will try this approach and post the results.
Offline
I don't see the license information in the metadata, but it will be helpful.
In the other archive you can find the licenses, which is in the link in post #4 (I missed that you needed the licenses )
Offline