Additional meta-data for Pacman?

jernst · 2015-01-01 00:44:26

The .db files contain a bunch of package metadata that is useful for a user to have without needing to download the package. Question:

is there a mechanism to extend the types of metadata that can be made available to the user via the .db files? (I think the answer is "no" although it should be fairly straightforward by defining additional % fields if one was willing to extend pacman and possibly makepkg)
am I the first person to ask for that kind of thing, or has this been discussed before?

Rationale: I'd like to make it simple for the user to ask queries such as:

get me the list of packages that can make use of my board's GPIO pins
get me the list of packages that support Zigbee (or, insert your favorite protocol here)
get me a list of packages according to some query, but only show me those that don't need more than 256M of RAM

I don't know what the list of metadata fields would end up being, and I'm sure all sorts of people would use it in all sorts of ways. This question is about ability to extend the metadata managed by pacman and friends.

(Just putting an extra file with extra metadata into the package would solve the problem of capturing the metadata, but that would require the user to download the package before knowing whether it meets what they were looking for. So I'm looking for a way of putting it into the .db files)

karol · 2015-01-01 01:13:46

Who's going to provide this info? Packages generally don't have changelogs because maintainers didn't have time to write them. Many (most?) unofficial repos don't provide metadate for pkgfile - a tool that will tell you which package provides particular file even if the files.db creation is pretty straightforward and automatic. I don't think you should expect anything beyond the package name and description.

ANOKNUSA · 2015-01-01 01:13:59

jernst wrote:

Rationale: I'd like to make it simple for the user to ask queries such as:
get me the list of packages that can make use of my board's GPIO pins
get me the list of packages that support Zigbee (or, insert your favorite protocol here)
get me a list of packages according to some query, but only show me those that don't need more than 256M of RAM

These are deceptively simple, generic queries: programs not specifically described as being for GPIO hardware can be used with it all the same; protocols supported by applications/programs are likely to be mentioned in the package description; and the amount of RAM used by a program will vary depending on use case and platform. Even if this were feasible (it's not---where would you draw the line on what metadata to include, what criteria would be used and how would you get all submitters/maintainers to include it?), you'd still just get a big list of package names and 25-character descriptions. The list of programs would in many cases still be both vast and incomplete.

ids1024 · 2015-01-01 02:00:42

You can create your own program to handle this data, setup a server to host it, and try to get a community submitting data. For some metadata, an automated script might work. Or you could just have locally stored metadata you have written yourself.

Have fun coding and collecting data, and don't forget to share your result when you're done!

Or, if the data is available in some other database (such as another package manager's repo), you might be able to code something to get it from there. You would have to find a way to map that to archlinux package names, however.

Edit: My first suggestion is semi-serious. It is an interesting idea, but I doubt it would work.

Last edited by ids1024 (2015-01-01 02:57:03)

Trilby · 2015-01-01 02:14:05

I think I'm missing something. Being able to "extend" this data by adding your own field sounds nice ... until you think about it. Say you could easily add fields, just by putting them in the PKGBUILD. What would this accomplish? The fields serve no purpose unless all (or at least a majority) of packages define that same field with the same standards.

Package name, size, description, and a few other items are pretty clearly defined for all packages - and these are the fields we have. I could see potential for suggesting a new field or two to be added to this standard, but if different packages all had different fields, then the data would serve no purpose at all.

jernst · 2015-01-01 04:04:30

I'm hearing two things, the gist of which is:

no, pacman does not currently have a metadata extensibility mechanism, and it hasn't really been discussed before either.
even if such a mechanism existed (in some fashion), we doubt that would be particularly useful because we don't see that enough people would add enough metadata to enough packages.

My question was about the first one, and I have an answer, although not the one I had been secretly hoping for... Thank you.

On the second, I don't know. (I don't think anybody would know unless we tried and see what happens.) I like ids1024's suggestion that the metadata could be maintained separately from the package itself (and thus created by people other than the package maintainers). That scheme has worked quite well in the past for other kinds of metadata annotations (think delicious tags, or blogrolls, HTML link tags or even FB Like's). Even the "flag package out of date" on the arch package site is a form of that kind of metadata annotation.

The disadvantage of maintaining metadata separately is that it's likely to become inconsistent with the packages themselves, at least temporarily. And it leaks information about the user's searches to the provider of the metadata repository, something neatly avoided through the .db mechanism. When I asked about pacman extensions, I thought the build process (repo-add) could simply add a few more fields and the existing infrastructure could be reused with all of its advantages.

jernst · 2015-01-01 04:06:48

Trilby: not necessarily. For example, xmpp servers could be annotated with which version of xmpp they support. RSS readers which RSS, Atom etc. they support. And obviously, those kinds of metadata annotations only make sense for xmpp servers and RSS readers and not, say, libc :-)

ids1024 · 2015-01-01 04:32:37

jernst wrote:

On the second, I don't know. (I don't think anybody would know unless we tried and see what happens.) I like ids1024's suggestion that the metadata could be maintained separately from the package itself (and thus created by people other than the package maintainers). That scheme has worked quite well in the past for other kinds of metadata annotations (think delicious tags, or blogrolls, HTML link tags or even FB Like's). Even the "flag package out of date" on the arch package site is a form of that kind of metadata annotation.
The disadvantage of maintaining metadata separately is that it's likely to become inconsistent with the packages themselves, at least temporarily. And it leaks information about the user's searches to the provider of the metadata repository, something neatly avoided through the .db mechanism. When I asked about pacman extensions, I thought the build process (repo-add) could simply add a few more fields and the existing infrastructure could be reused with all of its advantages.

I do find this an interesting idea; perhaps it's worth trying. I expect few people would be involved and it would fail, but perhaps not.

Who would be interested in such a thing? (i.e. Who would be willing to spend at least a little time adding metadata to packages?)

We would need somewhere to host the project. I might throw together a prototype based on github. That would require people to be granted access to the repo or submit a pull request to submit, so we would probably want something different if it ever took off. Using git would deal with the issue of private searches, because you would just clone the repo.

ids1024 · 2015-01-01 05:40:47

I have created a prototype if anyone is interested.

The metadata is in this repo: https://github.com/ids1024/archcommunitymetadata
A program to access the metadata is in this one: https://github.com/ids1024/archcommunitymetadata-tools

$ ./communitymeta --sync
Cloning into '/home/ian/.communitymeta'...
remote: Counting objects: 4, done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 4 (delta 0), reused 4 (delta 0)
Receiving objects: 100% (4/4), done.
Checking connectivity... done.
$ ./communitymeta protocol=imap
mutt
$ ./communitymeta protocol=rss protocol=atom
newsbeuter

The metadata format is simple. Here is the file for mutt:

protocol imap smtp
format maildir mbox mmdf mh

The first column of each line is the metadata parameter, the following ones the values. Spaces can be included using quotes.

This is a crude prototype, everything subject to change, but actual code is probably not a bad way to get a discussion going. Comments/sugestions/pull requests welcome!

ANOKNUSA · 2015-01-01 06:10:12

ids1024 wrote:

I have created a prototype if anyone is interested.
The metadata format is simple. Here is the file for mutt:
protocol imap smtp
format maildir mbox mmdf mh
The first column of each line is the metadata parameter, the following ones the values. Spaces can be included using quotes.

Suppose you want to find something to fetch and sort email via IMAP. Simply searching for 'imap' will pop up an entry called 'mutt.' Mutt is an MUA; it's used to read and write mail. You read and write mail in Emacs and so don't need an MUA, but just an MTA. It doesn't matter---any and every email-related search will result in a list that includes Mutt, because you've chosen to categorize Mutt as everything email-related simply because it has some rudimentary connection to all those things. In fact, if I were trying to set up an email server and went looking for all the pieces Mutt would pop up as an option almost every time, even though Mutt cannot itself play any role in a mail server.You've obfuscated the actual nature and purpose of the program by applying too many labels to the package and multiplied the amount of data that needs to be managed at the same time.

I never meant to imply this could not be done somehow. I meant to imply it was a bad idea.

ids1024 · 2015-01-01 06:28:51

ANOKNUSA wrote:

ids1024 wrote:
I have created a prototype if anyone is interested.
The metadata format is simple. Here is the file for mutt:
protocol imap smtp
format maildir mbox mmdf mh
The first column of each line is the metadata parameter, the following ones the values. Spaces can be included using quotes.
Suppose you want to find something to fetch and sort email via IMAP. Simply searching for 'imap' will pop up an entry called 'mutt.' Mutt is an MUA; it's used to read and write mail. You read and write mail in Emacs and so don't need an MUA, but just an MTA. It doesn't matter---any and every email-related search will result in a list that includes Mutt, because you've chosen to categorize Mutt as everything email-related simply because it has some rudimentary connection to all those things. In fact, if I were trying to set up an email server and went looking for all the pieces Mutt would pop up as an option almost every time, even though Mutt cannot itself play any role in a mail server.You've obfuscated the actual nature and purpose of the program by applying too many labels to the package and multiplied the amount of data that needs to be managed at the same time.
I never meant to imply this could not be done somehow. I meant to imply it was a bad idea.

You do have a point, but that would just mean some other metadata field might be useful, or you would have to live with mutt popping up in the results. There could easily be a field specifying "server". Servers that in some way support imap are probably more or less limited to imap servers. It's not perfect, but it could be useful to have some metadata beyond what pacman stores. This will of course be imperfect with anything short of a superintendent AI with natural language processing that analyzes the source code and documentation and knows everything about every program.

I suppose an important part to such a project would be determining what metadata fields to store.

jasonwryan · 2015-01-01 06:32:03

ids1024 wrote:

I suppose an important part to such a project would be determining what metadata fields to store.

The last words heard before many a multi-million dollar IT project headed off into oblivion.

What you need now is, I am guessing here so humour me, a committee?

ids1024 · 2015-01-01 06:51:03

jasonwryan wrote:

ids1024 wrote:
I suppose an important part to such a project would be determining what metadata fields to store.
The last words heard before many a multi-million dollar IT project headed off into oblivion.
What you need now is, I am guessing here so humour me, a committee?

Yes, a committee is exactly what is needed! Who wants to join? Finding volunteers is hard, so we're accepting anyone who can vaguely define metadata.

I don't really expect this to succeed, but it is an interesting idea, and I think it's worth trying. If a committee can't be formed, I can always try using my own omniscience. I just need to remember how to do that...

jernst · 2015-01-01 21:09:08

Kudos to ids1024 who -- unlike me -- did something useful with their New Year's Eve instead of just drinking Champagne :-) Working code!!

I agree with jasonwryan's comment about the multimillion dollar metadata failures. In my experience though, those projects tend to fail because they attempted to categorize the universe instead of solving very specific problems bottom up.

So I'd like to suggest to do just that: focus on very specific use cases that have value, and only make those work. And then go from there. I suggested some of those at the beginning of this thread. Any others?

Here's another I am interested in: give me all packages that contain wordpress plugins. (not skins. or vice versa.) That could be represented as two fields: "extends=wordpress" and "extensiontype=plugin" or such.

Another: give me all packages that can read file type X (where X is an extension, or maybe a mime type).

Last edited by jernst (2015-01-01 21:10:29)

ids1024 · 2015-01-01 21:36:34

It seems that with such a project, there should be a few "core fields," which we intend to define for every package (that is the goal, though not likely the reality). There should not be many, perhaps three to five. Additional fields could be added, but there would have to be some standardization. It would be rather bad if there was no organization, and people stored the same data under different names in different packages.

Some ideas for such core fields:

protocol - what protocol(s) are supported
format - what file formats are supported (based on name, mime type, extension?)
Possibly:
gui - boolean indicating if a package is/has a gui (could be determined automatically through dependencies)
server - boolean indication if the package is/has a server of some sort

I have adopted the standard of making all field names singular.

I do not think RAM usage would be a good idea, because that is too variable. Information on resource consumption could be useful, but would require testing on various machines, using the software in different ways. Getting a meaningful value seems unrealistic.

ids1024 · 2015-01-01 22:06:15

I have added support for some pacman fields to my script.

Examples:

$ ./communitymeta repo=core
$ ./communitymeta group=base-devel
$ ./communitymeta depend=qt4 repo=extra

jernst · 2015-01-01 22:13:35

If people stored the same thing under different names, sooner or later that would converge because it's in everybody's interest to get more value out of the same amount of data/work. (Unless there's a competitive situation, which, hopefully, we don't have here.) So I'm not worry about that one.

Agree with the gui and server etc, but would suggest "headless" (i.e. does not require keyboard/mouse/monitor) vs. "gui". The gui values could reflect the windowing systems supported. "Server" is hard because, say, an ftpd is a server, but it could be started on demand, inetd-style, and where exactly is the boundary here: echo could be a server, too.

ids1024 · 2015-01-01 22:29:18

jernst wrote:

If people stored the same thing under different names, sooner or later that would converge because it's in everybody's interest to get more value out of the same amount of data/work. (Unless there's a competitive situation, which, hopefully, we don't have here.) So I'm not worry about that one.
Agree with the gui and server etc, but would suggest "headless" (i.e. does not require keyboard/mouse/monitor) vs. "gui". The gui values could reflect the windowing systems supported. "Server" is hard because, say, an ftpd is a server, but it could be started on demand, inetd-style, and where exactly is the boundary here: echo could be a server, too.

Different names could of course be fixed with a find and replace, so I suppose it is not so much of an issue. There should be some file documenting all the fields that could be added to by contributors.

Perhaps the field should not be gui, but "interface". This could have values such as headless, cli, curses, gui (as well as gtk, qt, etc.), and webui. If the package provides several, the metadata will include them all. I agree there are problems with server, so perhaps that should be left off the list of core fields.

Have you looked at my proof of concept implementation? It is incomplete, and there is essentially no metadata yet (other than what is in the repos), but I would like to see what people think of my metadata format and the interface provided by the communitymeta program.

Last edited by ids1024 (2015-01-01 22:30:44)

progandy · 2015-01-01 22:52:12

Have you looked at my proof of concept implementation? It is incomplete, and there is essentially no metadata yet (other than what is in the repos), but I would like to see what people think of my metadata format and the interface provided by the communitymeta program.

I'd guess a document oriented databasee like couchdb or mongodb would serve better than plain text files in a git repositorry if the project ever grows to include all arch packages and aur.

Last edited by progandy (2015-01-01 23:04:09)

ids1024 · 2015-01-01 23:03:35

progandy wrote:

Have you looked at my proof of concept implementation? It is incomplete, and there is essentially no metadata yet (other than what is in the repos), but I would like to see what people think of my metadata format and the interface provided by the communitymeta program.
I'd guess a document oriented databasee like couchdb or mongodb would serve better than plain text files in a git repositorry if the project ever grows to include all arch packages.

Yes, certainly it would take too long to read one file for every package each query. My current idea is to have the source files as plain text and then store the data in an sqlite database each sync. Git could determine which files have been changed, so that they do not all have to be read and parsed on each sync. I don't know how well such a system would scale to the level of all packages arch provides.

progandy · 2015-01-01 23:07:52

ids1024 wrote:

Yes, certainly it would take too long to read one file for every package each query. My current idea is to have the source files as plain text and then store the data in an sqlite database each sync. Git could determine which files have been changed, so that they do not all have to be read and parsed on each sync. I don't know how well such a system would scale to the level of all packages arch provides.

On the other hand, it works for the normal pacman databases, so you'll just have to see what happens when the database grows. I like the simplistic format you chose. I'd suggest to add a better query language with logical and/or and maybe grouping.

I have something else for you to consider. If you build a webservice and a client without an offline database you can analyze the queries for required metadata and improve the coverage for those specific keywords.

ids1024 · 2015-01-01 23:26:32

progandy wrote:

On the other hand, it works for the normal pacman databases, so you'll just have to see what happens when the database grows. I like the simplistic format you chose. I'd suggest to add a better query language with logical and/or and maybe grouping.
I have something else for you to consider. If you build a webservice and a client without an offline database you can analyze the queries for required metadata and improve the coverage for those specific keywords.

I think this format will do for now. It can easily be migrated to some other system, and it is best not to add unnecessary complexity.

As for the query language, adding regex support would be easy, and could help. I will work on it soon.

progandy · 2015-01-01 23:35:09

ids1024 wrote:

As for the query language, adding regex support would be easy, and could help. I will work on it soon.

I don't believe regex is necessary. I'd like to search for e.g " ( format=png or format=jpg ) and gui=yes"

jernst · 2015-01-02 01:52:48

I think there are different questions to consider:

who creates (and augments) metadata for a particular package, and how do they do that
where does that information reside, and in which format
who "approves" / checks for correctness, and how do they do that
where and in what format does that canonical metadata reside
when a user performs a query, where does the query run? (It could be on the user's machine, or on some website)
if the metadata is replicated, how does it get sync'd (the equivalent of pacman -Sy)
how do package upgrades get managed when the set of metadata that's correct for different package versions is different?

git is an interesting choice, and certainly great for prototyping. I'm not sure though that it is the right choice for a production system. But it all depends on what answers we want for the above, and perhaps some other questions.

For example, an entirely different implementation could be to augment the code that runs on www.archlinux.org/packages so people can "tag" packages with metadata beyond "flag this package out of date". And a package maintainer could pull in the current consensus on what the right tags are every time they rebuild their package.

ids1024 · 2015-01-02 02:38:07

jernst wrote:

I think there are different questions to consider:
who creates (and augments) metadata for a particular package, and how do they do that
where does that information reside, and in which format
who "approves" / checks for correctness, and how do they do that
where and in what format does that canonical metadata reside
when a user performs a query, where does the query run? (It could be on the user's machine, or on some website)
if the metadata is replicated, how does it get sync'd (the equivalent of pacman -Sy)
how do package upgrades get managed when the set of metadata that's correct for different package versions is different?
git is an interesting choice, and certainly great for prototyping. I'm not sure though that it is the right choice for a production system. But it all depends on what answers we want for the above, and perhaps some other questions.
For example, an entirely different implementation could be to augment the code that runs on www.archlinux.org/packages so people can "tag" packages with metadata beyond "flag this package out of date". And a package maintainer could pull in the current consensus on what the right tags are every time they rebuild their package.

My current implementation is functional, and should work for now. Another system could be used later of course.

Here is how it works:
1.) Anyone creates metadata by writing it in the standard format
2.) The information is in a git repo (currently on github). The format is simple. Each line is a metadata parameter. The first column is the name of the parameter, and the following ones are values. They are separated by spaces, and spaces can be included in values by using quotes. Comments can be added with #. The format is very simple and should handle all requirements.
3.) Currently, people can submit a pull request on the repo and I can approve it. It can be moved to a github organization, and other people can be granted access to the repo. "Trusted users" essentially. Alternatively, anyone (or anyone with an account of some sort) could submit metadata without approval. That will not work as long as it is based on github, however.
4.) I'm not sure what you mean. If you're referring to the names of metadata fields, there would be a text document somewhere describing that.
5.) The client clones the metadata repo and queries are done locally.
6.) Metadata is synced with "communitymeta --sync", which is already implemented.
7.) Metadata would probably not change much between versions. If it is something likely to change every version, it should not be included, because that would be too difficult to maintain. There could be a script that checks for releases (or specifically major releases) since the metadata for a package was updated.

I don't see any problem with using git. ABS (and freebsd ports?) uses svn. Pull requests may not be the best way of submitting data, however.

I think we should use the system I have designed at first, unless someone has a particular idea that would work better. It is a simple design and works, so we might as well try it. We can of course change it later on.

Arch Linux

#1 2015-01-01 00:44:26

Additional meta-data for Pacman?

#2 2015-01-01 01:13:46

Re: Additional meta-data for Pacman?

#3 2015-01-01 01:13:59

Re: Additional meta-data for Pacman?

#4 2015-01-01 02:00:42

Re: Additional meta-data for Pacman?

#5 2015-01-01 02:14:05

Re: Additional meta-data for Pacman?

#6 2015-01-01 04:04:30

Re: Additional meta-data for Pacman?

#7 2015-01-01 04:06:48

Re: Additional meta-data for Pacman?

#8 2015-01-01 04:32:37

Re: Additional meta-data for Pacman?

#9 2015-01-01 05:40:47

Re: Additional meta-data for Pacman?

#10 2015-01-01 06:10:12

Re: Additional meta-data for Pacman?

#11 2015-01-01 06:28:51

Re: Additional meta-data for Pacman?

#12 2015-01-01 06:32:03

Re: Additional meta-data for Pacman?

#13 2015-01-01 06:51:03

Re: Additional meta-data for Pacman?

#14 2015-01-01 21:09:08

Re: Additional meta-data for Pacman?

#15 2015-01-01 21:36:34

Re: Additional meta-data for Pacman?

#16 2015-01-01 22:06:15

Re: Additional meta-data for Pacman?

#17 2015-01-01 22:13:35

Re: Additional meta-data for Pacman?

#18 2015-01-01 22:29:18

Re: Additional meta-data for Pacman?

#19 2015-01-01 22:52:12

Re: Additional meta-data for Pacman?

#20 2015-01-01 23:03:35

Re: Additional meta-data for Pacman?

#21 2015-01-01 23:07:52

Re: Additional meta-data for Pacman?

#22 2015-01-01 23:26:32

Re: Additional meta-data for Pacman?

#23 2015-01-01 23:35:09

Re: Additional meta-data for Pacman?

#24 2015-01-02 01:52:48

Re: Additional meta-data for Pacman?

#25 2015-01-02 02:38:07

Re: Additional meta-data for Pacman?

Board footer