You are not logged in.

#1 2014-12-10 19:12:55

maggie
Member
Registered: 2011-02-12
Posts: 255

Program to download and archive an entire website

I am looking for a program that will download all pages within a website including pictures and text but automated. I have been doing this by hand simply by saving each page but there must be a smarter way.

Offline

#2 2014-12-10 19:16:21

graysky
Wiki Maintainer
From: :wq
Registered: 2008-12-01
Posts: 10,643
Website

Re: Program to download and archive an entire website

..did you even bother googling this?


CPU-optimized Linux-ck packages @ Repo-ck  • AUR packagesZsh and other configs

Offline

#3 2014-12-10 19:26:49

ugjka
Member
From: Latvia
Registered: 2014-04-01
Posts: 1,862
Website

Re: Program to download and archive an entire website

How do these kind of programs deal with 'javascript heavy'/'stream based' websites that don't have actual html content?


https://ugjka.net
paru > yay | vesktop > discord
pacman -S spotify-launcher
mount /dev/disk/by-...

Offline

#4 2014-12-10 19:28:22

drcouzelis
Member
From: Connecticut, USA
Registered: 2009-11-09
Posts: 4,092
Website

Re: Program to download and archive an entire website

It can be hard to search when you don't really know what they're called. I know them as "web crawlers". The only one I've used is httrack. It's both easy and difficult to use. Thanks to this stupid age of "Web 2.0" the concept of how much of the website the program should download can be tricky. hmm

EDIT: I bet there are convenient Firefox addons to do this as well. wink

Last edited by drcouzelis (2014-12-10 19:28:57)

Offline

#5 2014-12-10 19:35:29

maggie
Member
Registered: 2011-02-12
Posts: 255

Re: Program to download and archive an entire website

I found some like wget and httrack thanks to graysky's link. I just didn't know what to call them! I cannot get either of those to work though.

wget -r -Nc -mk URL

That only downloaded the first page but not any 1, 2, 3, links inside the main page.

httrack URL -W -O "~/website" -%v

That did the same as wget.

Mirror launched on Wed, 10 Dec 2014 14:30:49 by HTTrack Website Copier/3.48-19 [XR&CO'2014]
Bytes saved: 	210,64KiB	       Links scanned: 	12/13 (+0)
Time: 	10s	                       Files written: 	10
Transfer rate: 	13,33KiB/s (21,19KiB/s)Files updated: 	0
Active connections: 	1	       Errors: 	0

Maybe the website has some robots.txt that breaks this?

Offline

#6 2014-12-10 19:57:39

maggie
Member
Registered: 2011-02-12
Posts: 255

Re: Program to download and archive an entire website

I notice now that the website uses something in the url to bring up page 2, 3, and so on

http://url.com?2

and

http://url.com?3

and so on. I think that is messing up the programs.

Offline

#7 2014-12-10 20:14:07

karol
Archivist
Registered: 2009-05-06
Posts: 25,440

Re: Program to download and archive an entire website

Offline

#8 2014-12-11 07:36:55

archer108
Member
From: Zurich Switzerland
Registered: 2007-08-28
Posts: 59

Re: Program to download and archive an entire website

wget can do it for you:
like this one for google.com

wget --limit-rate=2000k --no-clobber --convert-links --random-wait -r -p -E -e robots=off -U mozilla google.com & tail -f nohup.out

Offline

Board footer

Powered by FluxBB