Downloadable Arch Wiki

CPUnltd · 2010-09-29 00:53:07

what about a script to convert all the pages to PDF (if possible, in such a way where relative links are preserved, like OpenOffice (... I mean LibreOffice...) can do? PDF'ing and packaging the whole wiki would be a nice option, not sure about size, but compact and portable would be covered... as well as easy access... I know I've seen a script that could take multiple pdf's and turn them into a single, page separated, pdf.. so I think it's a good option to discuss... Personally, I think keeping the pdf's separate, but in the same location, would be the best, as it would allow multi-part packaging of the whole wiki and multi-threaded downloading of the files in bulk...

karol · 2010-09-29 01:29:19

CPUnltd wrote:

what about a script to convert all the pages to PDF (if possible, in such a way where relative links are preserved, like OpenOffice (... I mean LibreOffice...) can do? PDF'ing and packaging the whole wiki would be a nice option, not sure about size, but compact and portable would be covered... as well as easy access... I know I've seen a script that could take multiple pdf's and turn them into a single, page separated, pdf.. so I think it's a good option to discuss... Personally, I think keeping the pdf's separate, but in the same location, would be the best, as it would allow multi-part packaging of the whole wiki and multi-threaded downloading of the files in bulk...

pdf-format would surely add to the size. The current wiki docs package is only 7.5 MB, so no downloading trick are necessary.
I was talking about searching the local wiki, how do you that?

crouse · 2010-11-19 20:08:38

Pierre wrote:

@karol: No, wget is broken. It cannot handle wildcard certs. Disable that check or use something like curl.
On-topic: Have also a look at http://meta.wikimedia.org/wiki/Data_dumps Fro my side I could create such a dump regulary. I could even maintain a simple package so we get it distributed over the mirrors.
Now its up to you to work on a working concept. :-)

Personally, I think it would be GREAT to allow users to mirror the Arch Linux media wiki.
We would then never suffer the fate of Gentoo's wiki, and when the official wiki is offline,
people would have the information still available.

Setting up a mediawiki is trivial, all we really need is the database.

Here is how I think it could be done.
--------------------------------------------------------------------------------

mysql> create database archwikibackup;

mysqldump -u root -p archwiki > archwiki.sql
mysql -u root -p archwikibackup < archwiki.sql

mysql> use database archwikibackup;

## Now we are going to remove any information that people should NOT have:
## Just substitute the word "Replaced" where passwords are etc....

UPDATE user SET user_email = 'Replaced' WHERE user_email LIKE '%';
UPDATE user SET user_password = 'Replaced' WHERE user_password LIKE '%';
UPDATE user SET user_newpassword = 'Replaced' WHERE user_newpassword LIKE '%';
UPDATE user SET user_options = 'Replaced' WHERE user_options LIKE '%';
UPDATE user SET user_touched = 'Replaced' WHERE user_touched LIKE '%';
UPDATE user SET user_token = 'Replaced' WHERE user_token LIKE '%';
UPDATE user SET user_email_authenticated = 'Replaced' WHERE user_email_authenticated LIKE '%';
UPDATE user SET user_email_tokenl = 'Replaced' WHERE user_email_token LIKE '%';
UPDATE user SET user_email_token_expires = 'Replaced' WHERE user_email_token_expires LIKE '%';

## Dump out the modified database
mysqldump -u root -p archwikibackup > archwikibackup.sql

## gzip the database to reduce file size
## This can reduce the size of the file up to 80%

gzip -9 archwikibackup.sql

NOW..... anyone could have the wikidatabase, and could setup a wiki.
Download the media wiki software from http://www.mediawiki.org/wiki/Download
Download the archlwiki.sql.gz file

gunzip archlwiki.sql.gz
mysql -u root -p YourDataBase < archwikibackup.sql

GZIP the sql data .... and that file should be fairly small.

I would be happy to test this at archlinux.us or archlinux.me if you would want to do this.
All the above could of course be scripted into a nightly dump, then anyone that wanted to mirror
the wiki, could. Perhaps a day behind, but they could. I could mirror the sql file so other could
download from my servers as well, if bandwidth was an issue.

CPUnltd · 2010-12-12 06:05:18

that would be cool, but not sure how much work would be required for that...

Is there a way to encompass the wiki itself tied to a very lightweight browser like dillo? sort of an all-in-one package complete with it's own browser?

Also, (this would be A LOT OF WORK, but VERY cool and EXTREMELY useful) but what about creating an ncurses copy of the wiki that could be accessed by something like lynx or another text browser (or just be ran via cli on its own)? again, I know it would be extensive work to remake the whole wiki in such a way, but it would also give those with a fresh (no-gui) install access to the wiki itself (having a package would make it offline accessible, especially if the person could just grab the pkg.tar.xz and install it with whatever deps would be necessary for similar functionality (at least navigation if not search as well) of the standard wiki in a cli-friendly option... what do you guys think?

crouse · 2010-12-12 23:55:29

I'd rather have a database dump, done as I stated above. Not much work to do that really, it could be automated, then set as a cron job on the server.
This way there could be several other sources for the wiki if the official version went offline for any reason, and none of them would have user password information or any other data that shouldn't be let out..... that's what those UPDATE lines were for in my above post.

I'm not sure any of this is on the top of the wiki maintainers list of things to do though... and that's fine. I just wanted to point out that simple solutions exist that aren't that complicated if they do want to make it available, and that I am more than willing to host the dump if bandwidth is an issue.

If nothing else, I would definitely like to mirror this on archlinux.us/wiki for the times when the official wiki is down.

All of this I've described assumes they have put the wiki in it's own database, which I am assuming they did.

Pierre · 2010-12-13 00:16:12

For public dumps I'd prefer the method I described earlier like in https://bbs.archlinux.org/viewtopic.php … 08#p804608

A complete compressed sql dump is 570MB atm with caches and temp data removed. It's also not as easy as you think. The DB is rather big and your script will not only block the db for some time but also fill up the table space quite soon (hint: creating and dropping a table does not free the space used)

crouse · 2010-12-14 19:21:05

Hmmm well, I don't see where the database should need to be 570MB compressed, that would be over a gig in size uncompressed I'm guessing..... there must be alot of images in it..... I'd have to do some research, but it should be possible to dump everything but the images/etc/ that are not a primary concern (IE: your method doesn't download those either as far as I can tell) , the size would then go down considerably.

I'm not quite sure what your referring to when you speak of "creating and dropping a table doesn't free the space used" ... not that it matters. In my solution proposed above, I actually suggested creating a "temp" database modifying it, dumping and compressing it. You would dump the original database, upload to the new database, modify, then dump and compress ........ all actions taken on the newly created database, not the original.

I'm guessing the database dump being rather large is IO bound and that's why it's slow... it could be dumped to a ramdisk if you have the ram available to do that.... which would limit the downtime of the database while it dumps.

If you are dumping InnoDB tables (I'm guessing you are) you can use the --single-transaction, and your database isn't locked at all.......

Lock all tables before dumping them. The tables are locked with READ
LOCAL to allow concurrent inserts in the case of MyISAM tables. For
transactional tables such as InnoDB and BDB, --single-transaction is
a much better option, because it does not need to lock the tables at
all.

If your NOT using InnoDB tables... you can use "--lock-tables=false".

The dump wouldn't have to be "perfect" ... IE, someone is in the process of changing something .... worst that would happen is your backup wiki db would have a corrupt entry... pretty unlikely actually though.

My method is "easy" ..... I've done it. Other than the database size, I really see no issues. My 384MB database takes about 4 minutes to run the dump/compress. Again, I'd have to dig into the wiki db and see where they store the uploaded images/etc ... but that table could be excluded making the database MUCH MUCH smaller as well.

I did look at your proposed "solution" ....... that would be much harder to implement than my method.
But of course that all depends on your perspective doesn't it

-----------
EDIT:

After getting home, I took the time to look through the wiki for "files" ... there are less than 50 files that I could see. The largest of which is only 281KB.
This would mean that the wiki database is 99.999% text. Compression of text would be approx 75% which would mean the wiki database dump is almost 2 GB's of text ?????? .

It appears that there are roughly 4,500 wiki entries. If compressed is 570mb... and uncompressed is close to 2 gigs... that would mean the average wiki entry each had 1/2 million characters... which would of course include every revision of the document. I'm not sure how mediawiki stores each entry.

No matter what, that is a lot of data.

Last edited by crouse (2010-12-15 01:23:35)

irtigor · 2011-01-21 03:35:54

A ZIM file would be nice: http://www.openzim.org/Main_Page

CPUnltd · 2011-01-21 07:18:12

wow, I never knew about that.... I actually REALLY like that idea! plan to look through it to see how to make such a file and will look around AUR to see if there's a package already for the reader (and hopefully the tools to make zim files)

well, that was quick... Kiwix (the zim file reader) is already in AUR and zim (the editor) is in community... so now, the question begs: who is willing/able to take up the task of making the wiki offline available via openZIM?

Last edited by CPUnltd (2011-01-21 07:22:07)

keenerd · 2011-03-05 15:55:30

Check out arch-wiki-lite in the AUR. It is a replacement for arch-wiki-docs.

The main selling point is the size, only 9MB to the 83MB of arch-wiki-docs. And the built in search (15x faster than grep -i) and built in viewer, both usable from the console.

Prebuilt package: http://kmkeen.com/tmp/arch-wiki-lite-20 … pkg.tar.xz

Last edited by keenerd (2011-03-05 22:45:27)

CPUnltd · 2011-03-05 16:04:12

pretty cool concept... definitely interesting...

EDIT: Should I be getting these messages during the build process?

warning, empty file: 00008568.html
warning, empty file: 00009835.html
warning, empty file: 00010673.html
warning, empty file: 00006712.html
warning, empty file: 00010409.html
warning, empty file: 00005681.html

Last edited by CPUnltd (2011-03-05 16:06:03)

keenerd · 2011-03-05 16:36:46

Yeah. I am relieved that someone else is getting those warnings. It is a bug in arch-wiki-docs and not the death cries of my hard drive.

CPUnltd · 2011-03-05 16:40:45

HAHA... I'd hope it's not a death cry of a hard drive... I got an SSD that's pretty new on this system, that would be a horrible thing.

MoonSwan · 2011-08-19 21:32:29

Installed this with "pacaur" and it worked right off. Thank you Mr Keenerd for the wiki!

quozlqr · 2012-11-27 02:15:15

Hi. I run Debian and wanted to thank you all, Arch's wiki is a very valuable resource for the whole Linux community, no matter what distro. irtigor proposed converting it to a ZIM file, if anyone is willing to work on this I would be really grateful, as I use kiwix (http://kiwix.org) to read wiki dumps offline (mostly wikipedia, several languages) and having a full offline copy of Arch's Wiki is something i've been looking for for a long time.

Listing it on kiwix's available downloads list is ok, too.

PythonShell · 2014-03-03 15:40:00

It is awful that my net can not access wiki.archlinux.org, but I can install any package I want. It is awesome this topic solve the problem.
archlinux wiki saves a lot of my time, great thanks.

karol · 2014-03-03 15:46:05

If you prefer arch-wiki-lite, it's in the repos now: https://www.archlinux.org/packages/comm … wiki-lite/

ewaller · 2014-03-03 15:57:15

This thread has become rather long in the tooth. I am going to close it as much has changed in the 4+ years since it started. If there is a need for further discussion, please start a new thread.

Thanks.

Arch Linux

#51 2010-09-29 00:53:07

Re: Downloadable Arch Wiki

#52 2010-09-29 01:29:19

Re: Downloadable Arch Wiki

#53 2010-11-19 20:08:38

Re: Downloadable Arch Wiki

#54 2010-12-12 06:05:18

Re: Downloadable Arch Wiki

#55 2010-12-12 23:55:29

Re: Downloadable Arch Wiki

#56 2010-12-13 00:16:12

Re: Downloadable Arch Wiki

#57 2010-12-14 19:21:05

Re: Downloadable Arch Wiki

#58 2011-01-21 03:35:54

Re: Downloadable Arch Wiki

#59 2011-01-21 07:18:12

Re: Downloadable Arch Wiki

#60 2011-03-05 15:55:30

Re: Downloadable Arch Wiki

#61 2011-03-05 16:04:12

Re: Downloadable Arch Wiki

#62 2011-03-05 16:36:46

Re: Downloadable Arch Wiki

#63 2011-03-05 16:40:45

Re: Downloadable Arch Wiki

#64 2011-08-19 21:32:29

Re: Downloadable Arch Wiki

#65 2012-11-27 02:15:15

Re: Downloadable Arch Wiki

#66 2014-03-03 15:40:00

Re: Downloadable Arch Wiki

#67 2014-03-03 15:46:05

Re: Downloadable Arch Wiki

#68 2014-03-03 15:57:15

Re: Downloadable Arch Wiki

Board footer