Program to download and archive an entire website

maggie · 2014-12-10 19:12:55

I am looking for a program that will download all pages within a website including pictures and text but automated. I have been doing this by hand simply by saving each page but there must be a smarter way.

graysky · 2014-12-10 19:16:21

..did you even bother googling this?

ugjka · 2014-12-10 19:26:49

How do these kind of programs deal with 'javascript heavy'/'stream based' websites that don't have actual html content?

drcouzelis · 2014-12-10 19:28:22

It can be hard to search when you don't really know what they're called. I know them as "web crawlers". The only one I've used is httrack. It's both easy and difficult to use. Thanks to this stupid age of "Web 2.0" the concept of how much of the website the program should download can be tricky.

EDIT: I bet there are convenient Firefox addons to do this as well.

Last edited by drcouzelis (2014-12-10 19:28:57)

maggie · 2014-12-10 19:35:29

I found some like wget and httrack thanks to graysky's link. I just didn't know what to call them! I cannot get either of those to work though.

wget -r -Nc -mk URL

That only downloaded the first page but not any 1, 2, 3, links inside the main page.

httrack URL -W -O "~/website" -%v

That did the same as wget.

Mirror launched on Wed, 10 Dec 2014 14:30:49 by HTTrack Website Copier/3.48-19 [XR&CO'2014]
Bytes saved: 	210,64KiB	       Links scanned: 	12/13 (+0)
Time: 	10s	                       Files written: 	10
Transfer rate: 	13,33KiB/s (21,19KiB/s)Files updated: 	0
Active connections: 	1	       Errors: 	0

Maybe the website has some robots.txt that breaks this?

maggie · 2014-12-10 19:57:39

I notice now that the website uses something in the url to bring up page 2, 3, and so on

http://url.com?2

and

http://url.com?3

and so on. I think that is messing up the programs.

karol · 2014-12-10 20:14:07

http://alternativeto.net/software/httra … form=linux

archer108 · 2014-12-11 07:36:55

wget can do it for you:
like this one for google.com

wget --limit-rate=2000k --no-clobber --convert-links --random-wait -r -p -E -e robots=off -U mozilla google.com & tail -f nohup.out

Arch Linux

#1 2014-12-10 19:12:55

Program to download and archive an entire website

#2 2014-12-10 19:16:21

Re: Program to download and archive an entire website

#3 2014-12-10 19:26:49

Re: Program to download and archive an entire website

#4 2014-12-10 19:28:22

Re: Program to download and archive an entire website

#5 2014-12-10 19:35:29

Re: Program to download and archive an entire website

#6 2014-12-10 19:57:39

Re: Program to download and archive an entire website

#7 2014-12-10 20:14:07

Re: Program to download and archive an entire website

#8 2014-12-11 07:36:55

Re: Program to download and archive an entire website

Board footer