[RESOLVED] Using wget to download website, but links are not being c

publicus · 2015-11-29 17:11:56

Hello,

I'm using wget in order to download rosettacode.org. This is what my syntax looks like:

$ wget -U Mozilla --wait=10 --limit-rate=20K --convert-links -p -m rosettacode.org

The file gets downloaded to /home/publicus/Downloads/downloaded_websites/rosettacode.org. When I open up the index.html, I find that all of the links are pointing to /wiki/SomeArticle.html. What I don't understand is why are the links not pointing to /home/publicus/Downloads/downloaded_websites/rosettacode.org/wiki/SomeArticle.html? I really don't want to have to either fix this by hand or write a script that will fix this problem. Is there a command line argument that I've overlooked?

Last edited by publicus (2015-12-11 02:07:48)

mhoogland · 2015-11-30 11:17:05

Try "wget -x -P foo -U Mozilla --wait=10 --limit-rate=20K --convert-links -p -m rosettacode.org"

teckk · 2015-11-30 21:57:09

rosettacode.org redirects to http://rosettacode.org/wiki/Rosetta_Code

Are you trying to get all of the linked pages on that front page?
I think that you need wget -r with a maximum depth. Or do you want just the hyperlinks?

Just links

lynx -dump -listonly http://rosettacode.org/wiki/Rosetta_Code | grep wiki > file.txt

If you want to save each page to disk something like this.

#! /usr/bin/env bash

cd $HOME/somewhere

url="http://rosettacode.org/wiki/"

list="Sieve_of_Eratosthenes
Parsing/RPN_to_infix_conversion
Write_language_name_in_3D_ASCII
Polymorphism
Longest_common_subsequence
Rosetta_Code:Add_a_Language
Help:Adding_a_new_programming_example
Rosetta_Code:Village_Pump/Suggest_a_language
Rosetta_Code:Village_Pump/Suggest_a_programming_task
Rosetta_Code:Add_a_Task
Rosetta_Code:Village_Pump/RC_extraction_Tool_and_Task
Rosetta_Code:Village_Pump/Sort_popular_pump_pages
Rosetta_Code:Village_Pump/Phix_geshi_file
Rosetta_Code:Village_Pump/Syntax_highlighting 
Rosetta_Code:Village_Pump/SMW_Examples_by_language_and_concept"

#Examples:
num=1
for i in $list; do
    #wkhtml get each page and convert to .pdf
	   wkhtmltopdf "${url}${i}" ${num}.pdf
	   #lynx get each page and convert to .pdf
	   lynx -dump -nolist -nomargins  "${url}${i}" | a2ps -B --borders=0 --columns=1 -o - | ps2pdf - ${num}.pdf
	
	   #lynx dump each page to .txt
	   lynx -dump "${url}${i}" >  ${num}.txt
	   #w3m dump each page to .txt
	   w3m -dump -T text/html "${url}${i}" >  ${num}.txt
	   #html2text dump each page to .txt
	   html2text "${url}${i}" > ${num}.txt
	
	   #curl dump each page to .html
	   curl -A "Mozilla/5.0" -Ls "${url}${i}" -o - > ${num}.html
	   #wget dump each page to .html
	   wget -U "Mozilla/5.0" "${url}${i}" -O - > ${num}.html
	   #lynx dump each page to .html
  	 lynx -dump -source "${url}${i}" > ${num}.html
	
	   #Pause between downloads
	   sleep 60
	   num=$(($num + 1))
done

If you want to mirror the whole site then you'll need to recursively download to the depth that you wish.

publicus · 2015-12-03 03:06:19

mhoogland wrote:

Try "wget -x -P foo -U Mozilla --wait=10 --limit-rate=20K --convert-links -p -m rosettacode.org"

The only difference that I noticed is that when the web-page was downloaded, it did so in a foo directory.

That did not work.

publicus · 2015-12-03 03:14:32

teckk wrote:

rosettacode.org redirects to http://rosettacode.org/wiki/Rosetta_Code

Are you trying to get all of the linked pages on that front page?
I think that you need wget -r with a maximum depth. Or do you want just the hyperlinks?

Just links

lynx -dump -listonly http://rosettacode.org/wiki/Rosetta_Code | grep wiki > file.txt

If you want to save each page to disk something like this.

#! /usr/bin/env bash

cd $HOME/somewhere

url="http://rosettacode.org/wiki/"

list="Sieve_of_Eratosthenes
Parsing/RPN_to_infix_conversion
Write_language_name_in_3D_ASCII
Polymorphism
Longest_common_subsequence
Rosetta_Code:Add_a_Language
Help:Adding_a_new_programming_example
Rosetta_Code:Village_Pump/Suggest_a_language
Rosetta_Code:Village_Pump/Suggest_a_programming_task
Rosetta_Code:Add_a_Task
Rosetta_Code:Village_Pump/RC_extraction_Tool_and_Task
Rosetta_Code:Village_Pump/Sort_popular_pump_pages
Rosetta_Code:Village_Pump/Phix_geshi_file
Rosetta_Code:Village_Pump/Syntax_highlighting 
Rosetta_Code:Village_Pump/SMW_Examples_by_language_and_concept"

#Examples:
num=1
for i in $list; do
    #wkhtml get each page and convert to .pdf
	   wkhtmltopdf "${url}${i}" ${num}.pdf
	   #lynx get each page and convert to .pdf
	   lynx -dump -nolist -nomargins  "${url}${i}" | a2ps -B --borders=0 --columns=1 -o - | ps2pdf - ${num}.pdf
	
	   #lynx dump each page to .txt
	   lynx -dump "${url}${i}" >  ${num}.txt
	   #w3m dump each page to .txt
	   w3m -dump -T text/html "${url}${i}" >  ${num}.txt
	   #html2text dump each page to .txt
	   html2text "${url}${i}" > ${num}.txt
	
	   #curl dump each page to .html
	   curl -A "Mozilla/5.0" -Ls "${url}${i}" -o - > ${num}.html
	   #wget dump each page to .html
	   wget -U "Mozilla/5.0" "${url}${i}" -O - > ${num}.html
	   #lynx dump each page to .html
  	 lynx -dump -source "${url}${i}" > ${num}.html
	
	   #Pause between downloads
	   sleep 60
	   num=$(($num + 1))
done

If you want to mirror the whole site then you'll need to recursively download to the depth that you wish.

The example in OP already does download the entire site. The problem is that when you download the index.html page and mouse over -- for example -- "Add a Language" (on the downloaded instance) I get this link: file:///wiki/Rosetta_Code:Add_a_Language

The problem with this is that the directory /wiki does not exist and none of the links work from that point on. What I'd love to see happen is to take the current directory that the file is stored in to have that directory path referenced in the link. That is where I'm stuck.

teckk · 2015-12-03 15:47:45

There are probably several ways to do that:

wget -k http://rosettacode.org/wiki/Rosetta_Code

Gets the index with all of the links pointing to their correct URL.
I then did:

sed -i 's/http:\/\/rosettacode.org\/wiki\//file:\/\/\/download\/dir\//g' Rosetta_Code

Rosetta_code now has all the links pointing to the download dir you specify.

Before:

head Rosetta_Code -n 116 | tail -n 1
<td><a href="http://rosettacode.org/wiki/Rosetta_Code:Village_Pump/Suggest_a_programming_task" title="Rosetta Code:Village Pump/Suggest a programming task">Village Pump/Suggest a programming task</a></td>

After:

head Rosetta_Code -n 116 | tail -n 1
<td><a href="file:///download/dir/Rosetta_Code:Village_Pump/Suggest_a_programming_task" title="Rosetta Code:Village Pump/Suggest a programming task">Village Pump/Suggest a programming task</a></td>

Change file:\/\/\/download\/dir to wherever you want it.

publicus · 2015-12-09 01:32:33

teckk wrote:

There are probably several ways to do that:
wget -k http://rosettacode.org/wiki/Rosetta_Code
Gets the index with all of the links pointing to their correct URL.
I then did:
sed -i 's/http:\/\/rosettacode.org\/wiki\//file:\/\/\/download\/dir\//g' Rosetta_Code
Rosetta_code now has all the links pointing to the download dir you specify.
Before:
head Rosetta_Code -n 116 | tail -n 1
<td><a href="http://rosettacode.org/wiki/Rosetta_Code:Village_Pump/Suggest_a_programming_task" title="Rosetta Code:Village Pump/Suggest a programming task">Village Pump/Suggest a programming task</a></td>
After:
head Rosetta_Code -n 116 | tail -n 1
<td><a href="file:///download/dir/Rosetta_Code:Village_Pump/Suggest_a_programming_task" title="Rosetta Code:Village Pump/Suggest a programming task">Village Pump/Suggest a programming task</a></td>
Change file:\/\/\/download\/dir to wherever you want it.

Ok, I think I'm following you. One question, in your second code snippet, what is "Rosetta_Code" supposed to be? Is it the directory storing the downloaded data? If so, when I tried this, this is the output that I got:

$ sed -i 's/http:\/\/rosettacode.org\/wiki\//file:\/\/\/home\/publicus\/Downloads\/downloaded_websites\/rosettacode.org2\//g' rosettacode.org2
sed: couldn't edit rosettacode.org2: not a regular file

I'm not really familiar with sed. I'm going to look at some examples in tutorials a little bit later.

teckk · 2015-12-09 15:51:48

One question, in your second code snippet, what is "Rosetta_Code" supposed to be?

Maybe I should have done that with better syntax.

wget -k http://rosettacode.org/wiki/Rosetta_Code -O myfile

Then use sed or awk on myfile to replace the http:// paths with file:/// paths.

You could do that with one line.

Examples:

wget http://rosettacode.org/wiki/Rosetta_Code -O -| sed 's/\/wiki\//file:\/\/\/download\/dir\//g' > myfile2

Or

curl http://rosettacode.org/wiki/Rosetta_Code -o - | sed 's/\/wiki\//file:\/\/\/download\/dir\//g' > myfile3

Open myfile2 with a web browser. You'll see that all of the links point to a local directory

Or just get all of the links that you are wanting from the page and save them how you wish
Example:

lynx -dump -listonly http://rosettacode.org/wiki/Rosetta_Code | grep "/wiki"

Or

lynx -dump -list_inline http://rosettacode.org/wiki/Rosetta_Code | grep "/wiki"

Or download the pages you want and convert them to small .pdf files, so that you ca archive them.

Example for a small simple .pdf:

wkhtmltopdf --no-background --no-images --disable-javascript --disable-plugins \
http://rosettacode.org/wiki/Rosetta_Code:Village_Pump/Suggest_a_programming_task output1.pdf

Make a list of all the links that you want and loop on the list. Better sleep between loops so that the site doesn't ban you.

The examples in post 3 give you different options.

If you need more help try:
http://mywiki.wooledge.org/BashFAQ/
http://www.commandlinefu.com/commands/browse
https://bbs.archlinux.org/viewforum.php?id=33
http://stackoverflow.com/

Read the man pages for
wget
curl
lynx
sed
awk

To make a .pdf of a man page

man -t wget | ps2pdf - wget_man.pdf

Slithery · 2015-12-09 18:27:33

If you're not set on using wget then take a look at httrack.

It's as easy as...

httrack rosettacode.org

publicus · 2015-12-11 02:07:23

slithery wrote:

If you're not set on using wget then take a look at httrack.
It's as easy as...
httrack rosettacode.org

Holy smokes! I just started downloading the site using httrack (not finished yet), but all of the links are set correctly and it seems to be working flawlessly.

Thanks to everyone that has helped me, but httrack is the silver bullet in this case.

Arch Linux

#1 2015-11-29 17:11:56

[RESOLVED] Using wget to download website, but links are not being c

#2 2015-11-30 11:17:05

Re: [RESOLVED] Using wget to download website, but links are not being c

#3 2015-11-30 21:57:09

Re: [RESOLVED] Using wget to download website, but links are not being c

#4 2015-12-03 03:06:19

Re: [RESOLVED] Using wget to download website, but links are not being c

#5 2015-12-03 03:14:32

Re: [RESOLVED] Using wget to download website, but links are not being c

#6 2015-12-03 15:47:45

Re: [RESOLVED] Using wget to download website, but links are not being c

#7 2015-12-09 01:32:33

Re: [RESOLVED] Using wget to download website, but links are not being c

#8 2015-12-09 15:51:48

Re: [RESOLVED] Using wget to download website, but links are not being c

#9 2015-12-09 18:27:33

Re: [RESOLVED] Using wget to download website, but links are not being c

#10 2015-12-11 02:07:23

Re: [RESOLVED] Using wget to download website, but links are not being c

Board footer