Distrib -e "Arch(org|code|pkgs|aur|forum|wiki|bugs|.*)?" -- thoughts

Dieter@be · 2010-09-12 09:31:14

extofme wrote:

packages are not stored in the DSCM, their contents are. the package itself is simply a top-level tree object in git, linking to all other trees and blobs comprising the package state, and a reference to said tree. this means everything and anything that is common between _any_ package and _any_ version will be reused; if ten unrelated packages reference the same file, only one copy will ever exist; blobs are the same. however, some packages may indeed create gigantic, singular blob type objects that always change, and this will be addressed (next...).
git compresses the individual objects itself, in gz format; this could be changed to use the xz format, or anything else. it also generates pack files full of differentiated objects, also compressed. it would not always be necessary to have the full history of a package (if you look somewhere above, i breifly touch this point with various "kinds" of packages, some capable of source rebuild, some capable of becoming any past source/binary version, some a single version/binary only, etc.). you would not have to retain all versions of "packages" if you did not want, but you could retrieve them at anytime so long as their components existed somewhere on the network. servers could be set up to provide all packs, all version, effectively and automatically performing the intended duty of the "arch rollback machine". the exact mechanism is not defined yet, but it will likely involve some sort of SHA routing protocol, to resolve missing chunks.
git's data model is stupid simple; structures can be created to represent a package, it's history, it's bugs/status, and it's information (wiki/etc.), in an independent way so they do not depend on each other, but still relate to each other, and possess knowledge of how to "complete" and find each other. it will not be structured in the typical way git is used now. unfortunately this is very low level git stuff, and difficult to explain properly, so i won't go there; just know that ultimately the system will only pull the objects you need to fulfill the directive you gave it, and there will be rules to control your object cache. your object cache can then be used to fulfill the requests of others; ie. P2P.

interesting, I didn't know you could somehow "filter" the amount of data you get when you pull.
I guess it's a bit like when you pull a branch, it will pull in everything required for that branch?

extofme wrote:

i intend to construct a "social", 100% distributed distribution platform, where everyone is a [potentially] contributing node and has [nearly] full access to all informations, each node's contributions are cryptographically verifiable, each node may easily participate in testing/discussion or lend computing resources, each node may republish variations of any object or collection under their own signature, each node may "track" or "follow" any number of signatures with configurable aggressiveness (no such thing as "official repos"; your personal "repo" is the unique overlay of other nodes you trust, and by proxy some nodes they trust; "official repos" degrade into an Arch signature, a Debian signature, Fedora, etc.), and finally, do all of this is a way that is agnostic to the customized distribution (or other package managers) above it, or it's goals, and eventually spread to other distros, thus creating a monstrous pool of shared bandwidth, space, ideas, and workload, whilst at the same time converging user/developer/tester/vendor/packager/contributor/etc. toward: person.

sounds interesting.

extofme · 2010-09-12 22:36:04

Dieter@be wrote:

extofme wrote:
packages are not stored in the DSCM, their contents are ... anything that is common between _any_ package and _any_ version will be reused; if ten unrelated packages reference the same file, only one copy will ever exist; blobs are the same.
... you would not have to retain all versions of "packages" if you did not want, but you could retrieve them at anytime so long as their components existed somewhere on the network ... the exact mechanism is not defined yet, but it will likely involve some sort of SHA routing protocol, to resolve missing chunks.
... structures can be created to represent a package, it's history, it's bugs/status, and it's information (wiki/etc.), in an independent way so they do not depend on each other, but still relate to each other, and possess knowledge of how to "complete" and find each other ... ultimately the system will only pull the objects you need to fulfill the directive you gave it ... your object cache can then be used to fulfill the requests of others; ie. P2P.
interesting, I didn't know you could somehow "filter" the amount of data you get when you pull.
I guess it's a bit like when you pull a branch, it will pull in everything required for that branch?

man, i write way to much in these posts... meh.

this is where it gets... sticky-nteresting. at the bottom most level, commit chains simply form a DAG [http://en.wikipedia.org/wiki/Directed_acyclic_graph], with each node linked to a (single) hierarchy [tree] of other objects (trees and blobs). git, when used as an SCM, has a particular method of constructing the chains, and assigns a particular meaning to the collocated hierarchies between commits; ie. the "changeset". thats it; that's everything that is git (with the exception of tags, which are really just commits pointing to other commits instead of a tree). thus, we have 3 core objects to make use of.

so what i'm saying, is that we have a rich data structure, efficient at representing overlapping sequences, and we can use this structure to assign our own meanings to collocated/ancestral/descendant relationships. for example, we might have "special" blobs/trees/commits that are really pointers to some function or ruleset, used to decipher the meaning of a DAG subtree, or maybe some would be used for bookkeeping purposes only. the big thing is that all objects are immutable, and once created they cannot be changed, so we must find clever ways of embedding immutable pointers into the DAG that lead us toward finding other related subtrees.

another example would be the root commit (no parent)... perhaps the first commit for each software would only contain a software's assigned UUID, and anything relating to that software would descend from that (wiki/bugs/the actual source), this way any subtree related to that UUID could be located simply looking for chains that terminate on that particular UUID (immutable).

so to actually answer the question, yes and no . yes, in that from a standard git client, you would probably do something like:

# git sync sys/linux/2.6.30

and the app + version would be synced to your machine. no, in that there will be some magic happening in the background: sys/linux/2.6.30 is not a real branch in the sense of git SCM, but is more like a node in /dev... when you do operations on it, stuff happens, and you get the information you want, but it would be much more involved than a simple "get everything from point A to point B", which is basically what `git pull` does.

Dieter@be wrote:

extofme wrote:
i intend to construct a "social", 100% distributed distribution platform, where everyone is a [potentially] contributing node and has [nearly] full access to all informations, each node's contributions are cryptographically verifiable, each node may easily participate in testing/discussion or lend computing resources, each node may republish variations of any object or collection under their own signature, each node may "track" or "follow" any number of signatures with configurable aggressiveness (no such thing as "official repos"; your personal "repo" is the unique overlay of other nodes you trust, and by proxy some nodes they trust; "official repos" degrade into an Arch signature, a Debian signature, Fedora, etc.), and finally, do all of this is a way that is agnostic to the customized distribution (or other package managers) above it, or it's goals, and eventually spread to other distros, thus creating a monstrous pool of shared bandwidth, space, ideas, and workload, whilst at the same time converging user/developer/tester/vendor/packager/contributor/etc. toward: person.
sounds interesting.

doesn't it

i am certainly not claiming to have it all worked out, and initial incarnations will probably be met with plagues of inefficiency and/or poor design choices. for all i know a DAG will not be a good enough structure, and something else will have to be used. I just really think something like this is the best future for us. also, i thoroughly enjoy playing with distributed/parallel computing, and it's just plain fun to think about the possibilities a foundation of this kind would enable.

C Anthony

Arch Linux

#26 2010-09-12 09:31:14

Re: Distrib -e "Arch(org|code|pkgs|aur|forum|wiki|bugs|.*)?" -- thoughts

#27 2010-09-12 22:36:04

Re: Distrib -e "Arch(org|code|pkgs|aur|forum|wiki|bugs|.*)?" -- thoughts

Board footer