Pacman with db support?

phrakture · 2005-03-19 19:37:20

beniro wrote:

hell slocate does it...
Really? I thought it used a true DB...slocate is obviously fast as all get out...

Well the lines start to get blurred... basically it just creates an index file... I don't understand the format, but check out /var/lib/slocate/slocate.db

basically you can think of it like this: [filename^@^path] and it's probably in alpha order in that file... so when it reads the file, it knows it only needs to read to the seperator... and probably binary searches... I don't know the implementation, and I'm making this up... it's probably much more complicated...

But the point is that something like this blurs the lines between db and flatfile... as it's really an index listing to the flatfiles (similar to VMS systems)

apeiro · 2005-03-19 22:22:41

The best technical point that's been brought up against the db idea was that all the scripts (custom and official) would have to be changed to properly read the database. That's annoying.

I have a strong hunch that a slow steady file fragmentation is happening with the pacman database, which causes slowdowns as time moves on (and package numbers grow). Typically, it's only slow the first time you run pacman after a boot, since cache will hold most of the data after that.

The reason is that pacman has to read an awful lot of small files, and as they get added/updated they start to get spread out on the filesystem and so pacman spends a long time just waiting for the i/o to finish.

[jvinet@mars ~]$ cd /var/lib/pacman
[jvinet@mars pacman]$ find . -type f | wc -l
5262
[jvinet@mars pacman]$ du -c . | tail -1
34540   total
[jvinet@mars pacman]$ echo 34540*1024/5262 | bc
6721

That's over 5000 files and only 34 MB. That takes awhile to read if they're scattered about on your hard drive, especially on certain filesystems (I'm told reiserfs ain't so hot at optimizing seeks on small files. ext3 fares a little better.)

Ideally, I'd love to have a dual system where a sqlite/bdb file can be generated as a cache from the filesystem database. Then pacman could use that for fast lookups with queries, updating the cache when it updates the master. I haven't put enough thought into this idea yet, but it may keep everybody happy if it's done right.

phrakture · 2005-03-19 23:34:13

would it be possible to combine the multiple files for each package into one? it wouldn't require much change, as you would simply use the same descriptor when parsing the different %SECTION%s... that would help just slightly with fragmentation...

Theoden · 2005-03-20 04:21:32

Well, I for one would not be in favor of putting a db backend to pacman. One of the most attractive things about Archlinux and it's pacman pkg. manager is the very simplicity and seeming 'unbreakableness' of it.

Sticking a db backend on pacman will add unnecessary complexity to pacman, increase the things that can break, and in general, make heavy a very lite and efficient pkg manager.

RPM has a db backend, and it has on many occasions suffered from indexing problems and created a nightmare for end users - especially when the db or it's format are updated. That's why I hate rpm - among other reasons.

I believe in the K.I.S.S. methodology ... Keep It Simple Stupid!

Let's not screw up a really good thing because one or two people choose to overload their systems and whine about it being slow for them. The vast majority of users are very happy with pacman and it's simple, not complex 'ease of use' approach to pkg management.

--Theoden :shock:

kpiche · 2005-03-20 05:56:33

If the sql/db was only used as a cache to speed up queries/initial load then the problems with modifying scripts would go away. And if the db got corrupted it could be regenerated. But it's likely the added code complexity would not be worth it.

An idea I was mentally toying around with was to create one big honkin' cache file, one per repository (local.cache, current.cache, etc), that contained all data from the small files in /var/lib/pacman/*whatever*. If it exists when pacman starts it can plow through the whole thing creating the tree structs. If it's not available then read the usual /var/lib/pacman files.

It's basic the memory structure serialization to/from disk concept. Give the file a simple format (just a crappy example):

%name%
unzip
%depends%
glibc
%files%
/usr/bin/unzip
%...etc...%
...otherdetails...
%%
%name%
nextpackage

Advantages:
- it's a simple text file
- low code complexity (open, read, close)
- reduces file accesses from thousands to one per repo
- don't care if it's missing
- pacman still modifies /var/lib/pacman but regenerate caches after transactions

Disadvantages:
- disk space ?
- file parser might be an issue, but could expand the "desc " file parser

Probably other issues I haven't thought about.... I was waiting for libification to try it out.

iphitus · 2005-03-20 06:33:31

I might just add, to use a 'proper' db in a program, *isnt* complex. It's braindead simple. It may well have been easier for judd to use a proper database from the start so that he didnt have to write his own.

The SQL syntax, used to access databases, whether it be sqlite, berkely db or mysql, is very simple.

The other argument you have against databases is a chance of the database being corrupted. This would only happen if there were a mistake in the code, the same way as the current method of storing package info could be corrupted if there was an error in that code.

It wouldnt make it harder for scripts to access the database either, I know sqlite comes with a command line client that works along the lines of "sqlite /path/to/database/ sql-query" and then outputs the results of the query. And again, SQL syntax is braindead easy.

iphitus

cactus · 2005-03-20 07:18:21

SQL syntax, while in and of itself has a small instruction set, can lead to compex issues, especially if you are trying to maintain referential integrity, and enforcing any of the higher levels of "normalization".

I for one, wouldn't want to have to deal with inner and outer joins when tracking package updates...maybe that is just me. A well thought out table structure would alleviate this, but I think that is going beyond what is required here. Relational layout might not be the best here.

Since all files are easily related to one element (package name), straight up hashing is likely going to be the quickest way to go..I think a simple dbd index hash against a file system organization would probably be the best..if it is needed at all.

Lets try to choose the best solution, not knee jerk "my idea so it has to rock", or "I know it so it must be easy for everyone" arguments. pacman is THE foundation of the distribution. As it goes, so goes ArchLinux.

miqorz · 2005-03-20 07:39:32

From what I'm seeing the ends just aren't justifying the means.

orelien · 2005-03-20 08:11:34

My point is that pacman still offers room for improvement. I think there are still a few areas that can be reworked (i.e. the current cache mechanism, some tweaks in the db structure could also help and it should be possible to reduce the memoy footprint)

The work for the library is quite active these days, so AFAIC, my point is to wait for its release, and then try to improve it. Considering (again) a db backend should come after.
Indeed, adding such a backend to pacman has been in the todo list for a while, but we've never come to the conclusion that the benefits overcome the averall complexity.

Sure, pacman is slow the first time it is started, but runs quite well afterwards. I admit the disk usage is high. But is it such a big deal?
I have the feeling that I'm one of the very few who stops or kills almost all of his applications (even X), before considering performing an upgrade of the system...
I can hardly imaging myself doing it while lots of applications are running.

BTW, in the days of pacman 2.7, I wrote an experimental version of pacman with support for gdbm. I was expecting far better performances regarding querying data, however the results were quite disappointing (but at that time, db were smaller... so, it may worth a new try).

Celti · 2005-03-20 08:36:12

zeppelin wrote:

Celti wrote:
I don't see the need.~Celti
Celti and other guys. Did you read Raven's post in bugs or not? I would like to hear you comments after reading it and not before. migorz I hope you don't attack me too
http://bugs.archlinux.org/index.php?do= … ments#4407
thank you

I read it, i duplicated the tests, and didn't see a problem with it. There may be something wrong with her system's configuration, because the longest it's *ever* taken me is 5 minutes.

cactus · 2005-03-20 22:00:16

wrote a little test in ruby. By no means is it thorough, probably not even useful..

Anyway..this is what I got. Mind you, this is with ruby, and I am not sure of their gdbm support (ie. how fast it is). I don't believe they have support for sqlite. They do have classic dbm support, but I figured gdbm would be a decent place to start..

                                      user     system      total        real
Create 100000 files:              1.610000   6.520000   8.130000 ( 22.773599)
Read filenames (like ls):         0.980000   0.050000   1.030000 (  1.236086)
Create 100000 GDBM elems:         1.680000   2.510000   4.190000 (166.082497)
Read all filenames from GDBM:     0.300000   0.080000   0.380000 (  0.402018)
Get 10 random GDBM elems:         0.000000   0.000000   0.000000 (  0.000347)
Stat 10 random files:             0.000000   0.000000   0.000000 (  0.000128)
Get 10,000 random GDBM elems:     0.110000   0.080000   0.190000 (  0.206045)
Stat 10,000 random files:         0.120000   0.020000   0.140000 (  0.171653)
Add 1 more random elem to GDBM:   0.000000   0.000000   0.000000 (  0.001311)
Add 1 more random file:           0.000000   0.000000   0.000000 (  0.000217)

Like I said, my code is probably terrible..I tried to make it is impartial either direction. You can see where file system access is better sometimes, and you can see where gdbm is better sometimes. Take the above with a grain of salt, becuase it likely doesn't typify "general usage".

Code

EDIT: Ran it again with more files..*cough*

                                      user     system      total        real
Create 500000 files:              8.790000  50.470000  59.260000 (1200.683372)
Create 500000 GDBM elems:         5.360000  27.830000  33.190000 (1937.958419)
Read filenames (like ls):         3.210000   0.300000   3.510000 (  3.543046)
Read all filenames from GDBM:     1.180000   0.380000   1.560000 (  1.564568)
Get 10 random GDBM elems:         0.000000   0.000000   0.000000 (  0.000429)
Stat 10 random files:             0.000000   0.000000   0.000000 (  0.000123)
Get 10,000 random GDBM elems:     0.020000   0.140000   0.160000 (  0.155662)
Stat 10,000 random files:         0.080000   0.010000   0.090000 (  0.113160)
Add 1 more random elem to GDBM:   0.000000   0.000000   0.000000 (  0.000507)
Add 1 more random file:           0.000000   0.000000   0.000000 (  0.000260)

i3839 · 2005-03-20 22:19:12

The only problem is that if the Pacman db isn't in cache, then all those files need to be read which takes a lot of time because the hd must seek like crazy. Using a "real" db only helps in that regard that it uses only a single file for all the information instead of many seperate files and dirs. When the data is in ram it doesn't matter if it are files or a db, as both are fast enough then (except if there's a bug in Pacman of course).

phrakture · 2005-03-20 23:47:48

it seems like the main problem with the current implementation is IO seek/read/write times due to using multiple small files... to me that doesn't warrant a change to a DB structure... but maybe consolidating the existing files some more

kpiche · 2005-03-21 02:09:30

phrakture wrote:

it seems like the main problem with the current implementation is IO seek/read/write times due to using multiple small files... to me that doesn't warrant a change to a DB structure... but maybe consolidating the existing files some more

Exactly! That's how I see it too...

phrakture · 2005-03-21 02:14:35

kpiche wrote:

phrakture wrote:
it seems like the main problem with the current implementation is IO seek/read/write times due to using multiple small files... to me that doesn't warrant a change to a DB structure... but maybe consolidating the existing files some more
Exactly! That's how I see it too...

I would def suggest combining all three files for a local package into one file.... after that, maybe each repo can have it's own DB file... perhaps seperated into sections of some format:

package glib
{
%VERSION%
1.4.10

%DEPENDS%
abacadaba

%FILES%
/etc/something/blah
}

just thinking outloud

cmp · 2005-03-21 11:01:34

then i would recommend xml, it's easy to parse human readable and can be easily modified by thusands of free tools.

arooaroo · 2005-03-21 11:12:08

cmp wrote:

then i would recommend xml, it's easy to parse human readable and can be easily modified by thusands of free tools.

Whilst is XML is human-readable, so are many other data formats. XML seems a tad overkill. Parsers are notoriusly over-complex, and not necessarily offering the performance boost that this whole thread is based upon.

iphitus · 2005-03-21 11:35:48

What I dont get the point of, is implementing a whole new database format of our own, whereas it would be simpler and quicker timewise to work with an existing format like sqlite or db. Existing ones are also tuned for speed and further bugtested too.

arooaroo · 2005-03-21 12:02:37

apeiro wrote:

The best technical point that's been brought up against the db idea was that all the scripts (custom and official) would have to be changed to properly read the database. That's annoying.
[snip]
Ideally, I'd love to have a dual system where a sqlite/bdb file can be generated as a cache from the filesystem database. Then pacman could use that for fast lookups with queries, updating the cache when it updates the master. I haven't put enough thought into this idea yet, but it may keep everybody happy if it's done right.

I personally don't think the db approach is necessary. Just sounds like you need a decent indexing algorithm. Anyway, it's probably fair to say that apeiro knows best here.

If he likes the idea of db, but is put off by the idea of losing compatibility with old packages then may I point out one thing. AL is probably going to get bigger before it gets smaller. So if db approaches are on the cards, then the sooner the better, for the sake of packagers, that it is.

phrakture · 2005-03-21 15:33:26

I'm going to agreee with arooaroo on both of his last posts
1) xml tends to be overkill... the thing that makes xml so great is the platform/transport agnosticism, which doesn't apply that well here (we don't need to worry about little/big endian, and things of that nature...). considering the xml approach is still reading files from disk, its going to actually impede performance from the current implementation
2) I think switching to a db backend isn't needed. It's going to be another layer, which could break... right now the only real way to break the pacman db is through disk failure (or deleting the db, which I've done once...)
3) judd knows best, and I'll leave the decision up to him, for now I'm happy with throwing ideas to the wind, just in case one of them has potential

shadowhand · 2005-03-22 01:26:04

I really didn't want to get into this discussion, but after reading 3-5 pages of this, I'm going to force myself to.

From what I understand here, the core issue is that pacman uses lots and lots of small files to keep track of available / installed packages. (Personally, I don't see a big issue with pacman, it's running exactly the same as it did when I first installed Arch, although sometimes it takes a bit longer to finish than it did. Granted, I have KDE, Gnome, XFCE, Enlightenment, and Openbox + other applications installed, so I can understand a couple extra seconds to make sure everything is still running fine.) Now, I don't believe that tying a database to a package manager makes it faster or more stable, but I do believe that it makes it more effecient. Both RPM and DPKG suffer from large amounts of packages, installed or just available.

I think there are other solutions that aren't being looked at here though. Remember, I'm not a programmer or a hacker, and I only understand package management at a very basic level. I'm just throwing out ideas for discussion becaue I think they haven't been addressed yet.

Lunar Linux uses a flat file system instead of multiple files and it's very fast. (It also has it's own package management system.)

Option 1: Use a single index file with the status of all tracked packages (installed [yes/no], etc.)

Seperation of the package manager and the package downloader/tracker. That's a terrible way of simplifying what APT does, but I think it works on some level. APT uses it's own database (right?), seperate from RPM/DPKG (APT is available for both), and tracks it's own packages.

Option 2: Is there some way to seprate the functions for installing packages from the program that downloads/tracks packages? (I have a feeling that Judd would be against this because it basically means turning Arch into Slackware [pkginstall compared to slapt-get basically] which I understand to be one of the reasons Judd wrote pacman.)

Relational database using a very small database application. This is basically the beginning of the discussion, but I just wanted to add a little. It appears that the main argument(s) here is KISS. Corruption of databases would be no fun, but then again, using a stable database program and the coding poweress of Judd, why are we worried about that? (Ick, the taste on my tounge... I hate brown-nosing. ) It appears that BerkDB is about 5Mb. The GCC package is about 6 times that size... is there really a worry that including a database program is invading your computer? If so, remind me again why you need vi and emacs, or gcc for that matter.

Option 3: Use a very simple database with a very small and simple database program to keep track of Arch packages. There are issues involved with doing this, but at the same time, you are future proofing Arch.

In a lot of ways, this whole discussion reminds me of the discussion about libidizing pacman. I think there are some people around here who want Arch to stay small for various reasons. I hate to be the messanger, because the messanger always gets shot, but Arch is not going to stay small. It's a wonderful distro, it's simple, and it's powerful. There are a lot of knowledgable people around here that will continue to make Arch more and more viable in the so-called "distro wars". Arch's is strong in all the places where other distros are weak and more and more people are going to start seeing that. Decisions like this one, and the decision to libidize pacman are extremely important things because they open Arch up to the sort of things that makes other distros weaker. I don't think there is any one right anwser here. This decision has to be made carefully and with a lot of thought about the implications in weeks, months, and years.

Is Arch ready to be as big as Debian and still be bleeding edge? Is pacman strong enough to be able to withstand as many users as Fedora and still be fast and stable? Is the community ready to be this supportive and active without crushing itself with a thousand or ten thousand new users?

(We now return you to your regularly scheduled programming. I need a cigarette.)

shadowhand · 2005-03-22 03:30:59

Did I destroy the discussion on this? I'm really sorry if I did.... I was hoping to stimulate it, not ruin it. :oops:

phrakture · 2005-03-22 04:41:46

naw, its night time 8)

cactus · 2005-03-22 07:42:44

Basically what you said was a rehash of what other people have said already. Not much new info in there, so don't feel like you killed anything..you didn't.

shadowhand · 2005-03-22 18:44:48

cactus wrote:

Basically what you said was a rehash of what other people have said already. Not much new info in there, so don't feel like you killed anything..you didn't.

Bascially, yes, but there are new points to think about. I didn't expect that short of a response from you Cactus.

Arch Linux

#26 2005-03-19 19:37:20

Re: Pacman with db support?

#27 2005-03-19 22:22:41

Re: Pacman with db support?

#28 2005-03-19 23:34:13

Re: Pacman with db support?

#29 2005-03-20 04:21:32

Re: Pacman with db support?

#30 2005-03-20 05:56:33

Re: Pacman with db support?

#31 2005-03-20 06:33:31

Re: Pacman with db support?

#32 2005-03-20 07:18:21

Re: Pacman with db support?

#33 2005-03-20 07:39:32

Re: Pacman with db support?

#34 2005-03-20 08:11:34

Re: Pacman with db support?

#35 2005-03-20 08:36:12

Re: Pacman with db support?

#36 2005-03-20 22:00:16

Re: Pacman with db support?

#37 2005-03-20 22:19:12

Re: Pacman with db support?

#38 2005-03-20 23:47:48

Re: Pacman with db support?

#39 2005-03-21 02:09:30

Re: Pacman with db support?

#40 2005-03-21 02:14:35

Re: Pacman with db support?

#41 2005-03-21 11:01:34

Re: Pacman with db support?

#42 2005-03-21 11:12:08

Re: Pacman with db support?

#43 2005-03-21 11:35:48

Re: Pacman with db support?

#44 2005-03-21 12:02:37

Re: Pacman with db support?

#45 2005-03-21 15:33:26

Re: Pacman with db support?

#46 2005-03-22 01:26:04

Re: Pacman with db support?

#47 2005-03-22 03:30:59

Re: Pacman with db support?

#48 2005-03-22 04:41:46

Re: Pacman with db support?

#49 2005-03-22 07:42:44

Re: Pacman with db support?

#50 2005-03-22 18:44:48

Re: Pacman with db support?

Board footer