You are not logged in.
Hello,
I'm using wget in order to download rosettacode.org. This is what my syntax looks like:
$ wget -U Mozilla --wait=10 --limit-rate=20K --convert-links -p -m rosettacode.orgThe file gets downloaded to /home/publicus/Downloads/downloaded_websites/rosettacode.org. When I open up the index.html, I find that all of the links are pointing to /wiki/SomeArticle.html. What I don't understand is why are the links not pointing to /home/publicus/Downloads/downloaded_websites/rosettacode.org/wiki/SomeArticle.html? I really don't want to have to either fix this by hand or write a script that will fix this problem. Is there a command line argument that I've overlooked?
Last edited by publicus (2015-12-11 02:07:48)
Offline
Try "wget -x -P foo -U Mozilla --wait=10 --limit-rate=20K --convert-links -p -m rosettacode.org"
Offline
rosettacode.org redirects to http://rosettacode.org/wiki/Rosetta_Code
Are you trying to get all of the linked pages on that front page?
I think that you need wget -r with a maximum depth. Or do you want just the hyperlinks?
Just links
lynx -dump -listonly http://rosettacode.org/wiki/Rosetta_Code | grep wiki > file.txtIf you want to save each page to disk something like this.
#! /usr/bin/env bash
cd $HOME/somewhere
url="http://rosettacode.org/wiki/"
list="Sieve_of_Eratosthenes
Parsing/RPN_to_infix_conversion
Write_language_name_in_3D_ASCII
Polymorphism
Longest_common_subsequence
Rosetta_Code:Add_a_Language
Help:Adding_a_new_programming_example
Rosetta_Code:Village_Pump/Suggest_a_language
Rosetta_Code:Village_Pump/Suggest_a_programming_task
Rosetta_Code:Add_a_Task
Rosetta_Code:Village_Pump/RC_extraction_Tool_and_Task
Rosetta_Code:Village_Pump/Sort_popular_pump_pages
Rosetta_Code:Village_Pump/Phix_geshi_file
Rosetta_Code:Village_Pump/Syntax_highlighting
Rosetta_Code:Village_Pump/SMW_Examples_by_language_and_concept"
#Examples:
num=1
for i in $list; do
#wkhtml get each page and convert to .pdf
wkhtmltopdf "${url}${i}" ${num}.pdf
#lynx get each page and convert to .pdf
lynx -dump -nolist -nomargins "${url}${i}" | a2ps -B --borders=0 --columns=1 -o - | ps2pdf - ${num}.pdf
#lynx dump each page to .txt
lynx -dump "${url}${i}" > ${num}.txt
#w3m dump each page to .txt
w3m -dump -T text/html "${url}${i}" > ${num}.txt
#html2text dump each page to .txt
html2text "${url}${i}" > ${num}.txt
#curl dump each page to .html
curl -A "Mozilla/5.0" -Ls "${url}${i}" -o - > ${num}.html
#wget dump each page to .html
wget -U "Mozilla/5.0" "${url}${i}" -O - > ${num}.html
#lynx dump each page to .html
lynx -dump -source "${url}${i}" > ${num}.html
#Pause between downloads
sleep 60
num=$(($num + 1))
doneIf you want to mirror the whole site then you'll need to recursively download to the depth that you wish.
Offline
Try "wget -x -P foo -U Mozilla --wait=10 --limit-rate=20K --convert-links -p -m rosettacode.org"
The only difference that I noticed is that when the web-page was downloaded, it did so in a foo directory.
That did not work.
Offline
rosettacode.org redirects to http://rosettacode.org/wiki/Rosetta_Code
Are you trying to get all of the linked pages on that front page?
I think that you need wget -r with a maximum depth. Or do you want just the hyperlinks?Just links
lynx -dump -listonly http://rosettacode.org/wiki/Rosetta_Code | grep wiki > file.txtIf you want to save each page to disk something like this.
#! /usr/bin/env bash cd $HOME/somewhere url="http://rosettacode.org/wiki/" list="Sieve_of_Eratosthenes Parsing/RPN_to_infix_conversion Write_language_name_in_3D_ASCII Polymorphism Longest_common_subsequence Rosetta_Code:Add_a_Language Help:Adding_a_new_programming_example Rosetta_Code:Village_Pump/Suggest_a_language Rosetta_Code:Village_Pump/Suggest_a_programming_task Rosetta_Code:Add_a_Task Rosetta_Code:Village_Pump/RC_extraction_Tool_and_Task Rosetta_Code:Village_Pump/Sort_popular_pump_pages Rosetta_Code:Village_Pump/Phix_geshi_file Rosetta_Code:Village_Pump/Syntax_highlighting Rosetta_Code:Village_Pump/SMW_Examples_by_language_and_concept" #Examples: num=1 for i in $list; do #wkhtml get each page and convert to .pdf wkhtmltopdf "${url}${i}" ${num}.pdf #lynx get each page and convert to .pdf lynx -dump -nolist -nomargins "${url}${i}" | a2ps -B --borders=0 --columns=1 -o - | ps2pdf - ${num}.pdf #lynx dump each page to .txt lynx -dump "${url}${i}" > ${num}.txt #w3m dump each page to .txt w3m -dump -T text/html "${url}${i}" > ${num}.txt #html2text dump each page to .txt html2text "${url}${i}" > ${num}.txt #curl dump each page to .html curl -A "Mozilla/5.0" -Ls "${url}${i}" -o - > ${num}.html #wget dump each page to .html wget -U "Mozilla/5.0" "${url}${i}" -O - > ${num}.html #lynx dump each page to .html lynx -dump -source "${url}${i}" > ${num}.html #Pause between downloads sleep 60 num=$(($num + 1)) doneIf you want to mirror the whole site then you'll need to recursively download to the depth that you wish.
The example in OP already does download the entire site. The problem is that when you download the index.html page and mouse over -- for example -- "Add a Language" (on the downloaded instance) I get this link: file:///wiki/Rosetta_Code:Add_a_Language
The problem with this is that the directory /wiki does not exist and none of the links work from that point on. What I'd love to see happen is to take the current directory that the file is stored in to have that directory path referenced in the link. That is where I'm stuck.
Offline
There are probably several ways to do that:
wget -k http://rosettacode.org/wiki/Rosetta_CodeGets the index with all of the links pointing to their correct URL.
I then did:
sed -i 's/http:\/\/rosettacode.org\/wiki\//file:\/\/\/download\/dir\//g' Rosetta_CodeRosetta_code now has all the links pointing to the download dir you specify.
Before:
head Rosetta_Code -n 116 | tail -n 1
<td><a href="http://rosettacode.org/wiki/Rosetta_Code:Village_Pump/Suggest_a_programming_task" title="Rosetta Code:Village Pump/Suggest a programming task">Village Pump/Suggest a programming task</a></td>After:
head Rosetta_Code -n 116 | tail -n 1
<td><a href="file:///download/dir/Rosetta_Code:Village_Pump/Suggest_a_programming_task" title="Rosetta Code:Village Pump/Suggest a programming task">Village Pump/Suggest a programming task</a></td>Change file:\/\/\/download\/dir to wherever you want it.
Offline
There are probably several ways to do that:
wget -k http://rosettacode.org/wiki/Rosetta_CodeGets the index with all of the links pointing to their correct URL.
I then did:sed -i 's/http:\/\/rosettacode.org\/wiki\//file:\/\/\/download\/dir\//g' Rosetta_CodeRosetta_code now has all the links pointing to the download dir you specify.
Before:
head Rosetta_Code -n 116 | tail -n 1 <td><a href="http://rosettacode.org/wiki/Rosetta_Code:Village_Pump/Suggest_a_programming_task" title="Rosetta Code:Village Pump/Suggest a programming task">Village Pump/Suggest a programming task</a></td>After:
head Rosetta_Code -n 116 | tail -n 1 <td><a href="file:///download/dir/Rosetta_Code:Village_Pump/Suggest_a_programming_task" title="Rosetta Code:Village Pump/Suggest a programming task">Village Pump/Suggest a programming task</a></td>Change file:\/\/\/download\/dir to wherever you want it.
Ok, I think I'm following you. One question, in your second code snippet, what is "Rosetta_Code" supposed to be? Is it the directory storing the downloaded data? If so, when I tried this, this is the output that I got:
$ sed -i 's/http:\/\/rosettacode.org\/wiki\//file:\/\/\/home\/publicus\/Downloads\/downloaded_websites\/rosettacode.org2\//g' rosettacode.org2
sed: couldn't edit rosettacode.org2: not a regular fileI'm not really familiar with sed. I'm going to look at some examples in tutorials a little bit later.
Offline
One question, in your second code snippet, what is "Rosetta_Code" supposed to be?
Maybe I should have done that with better syntax.
wget -k http://rosettacode.org/wiki/Rosetta_Code -O myfileThen use sed or awk on myfile to replace the http:// paths with file:/// paths.
You could do that with one line.
Examples:
wget http://rosettacode.org/wiki/Rosetta_Code -O -| sed 's/\/wiki\//file:\/\/\/download\/dir\//g' > myfile2Or
curl http://rosettacode.org/wiki/Rosetta_Code -o - | sed 's/\/wiki\//file:\/\/\/download\/dir\//g' > myfile3Open myfile2 with a web browser. You'll see that all of the links point to a local directory
Or just get all of the links that you are wanting from the page and save them how you wish
Example:
lynx -dump -listonly http://rosettacode.org/wiki/Rosetta_Code | grep "/wiki"Or
lynx -dump -list_inline http://rosettacode.org/wiki/Rosetta_Code | grep "/wiki"Or download the pages you want and convert them to small .pdf files, so that you ca archive them.
Example for a small simple .pdf:
wkhtmltopdf --no-background --no-images --disable-javascript --disable-plugins \
http://rosettacode.org/wiki/Rosetta_Code:Village_Pump/Suggest_a_programming_task output1.pdfMake a list of all the links that you want and loop on the list. Better sleep between loops so that the site doesn't ban you.
The examples in post 3 give you different options.
If you need more help try:
http://mywiki.wooledge.org/BashFAQ/
http://www.commandlinefu.com/commands/browse
https://bbs.archlinux.org/viewforum.php?id=33
http://stackoverflow.com/
Read the man pages for
wget
curl
lynx
sed
awk
To make a .pdf of a man page
man -t wget | ps2pdf - wget_man.pdfOffline
If you're not set on using wget then take a look at httrack.
It's as easy as...
httrack rosettacode.orgOffline
If you're not set on using wget then take a look at httrack.
It's as easy as...
httrack rosettacode.org
Holy smokes! I just started downloading the site using httrack (not finished yet), but all of the links are set correctly and it seems to be working flawlessly.
Thanks to everyone that has helped me, but httrack is the silver bullet in this case.
Offline