You are not logged in.
Hello.
I tried recursive downloading with wget and it worked first time well. Then I stopped the process and tried again and it refuses to work now. For example:
$ wget -r https://en.wikipedia.org
--2013-04-25 20:37:59-- https://en.wikipedia.org/
Resolving en.wikipedia.org... [...]
Connecting to en.wikipedia.org|[...]... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://en.wikipedia.org/wiki/Main_Page [following]
--2013-04-25 20:38:00-- https://en.wikipedia.org/wiki/Main_Page
Reusing existing connection to en.wikipedia.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 63653 (62K) [text/html]
Saving to: ‘en.wikipedia.org/index.html’
100%[=========================================================================================================>] 63 653 289KB/s in 0,2s
2013-04-25 20:38:00 (289 KB/s) - ‘en.wikipedia.org/index.html’ saved [63653/63653]
Loading robots.txt; please ignore errors.
--2013-04-25 20:38:00-- https://en.wikipedia.org/robots.txt
Reusing existing connection to en.wikipedia.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 28244 (28K) [text/plain]
Saving to: ‘en.wikipedia.org/robots.txt’
100%[=========================================================================================================>] 28 244 --.-K/s in 0,07s
2013-04-25 20:38:00 (379 KB/s) - ‘en.wikipedia.org/robots.txt’ saved [28244/28244]
FINISHED --2013-04-25 20:38:00--
Total wall clock time: 0,8s
Downloaded: 2 files, 90K in 0,3s (312 KB/s)
That was it. First time it worked and started downloading files recursively. And now it stops right there. Why?
Offline
It says the default depth is 5, but maybe something is wonky... try specifying "--recursive --level=N" where N is the depth of directories you desire...
Maybe you should specify what the directory structure looks like, so that an analysis can be made.
Edit: Nevermind, read the robots.txt file it downloaded for you. It contains this
# Sorry, wget in its recursive mode is a frequent problem.
# Please read the man page and use it properly; there is a
# --wait option you can use to set the delay between hits,
# for instance.
#
User-agent: wget
Disallow: /
Last edited by WonderWoofy (2013-04-25 16:51:48)
Offline
OK, what about gnu.org? It says in robots.txt:
# robots.txt for http://www.gnu.org/
User-agent: *
Crawl-delay: 4
Disallow: /private/
User-agent: *
Crawl-delay: 4
Disallow: /savannah-checkouts/
So I guess nothing should prevent downloading... But I get just
$ wget -r -nd -A jpg,jpeg http://gnu.org
--2013-04-25 21:37:36-- http://gnu.org/
Resolving gnu.org... [...]
Connecting to gnu.org|[...]|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://www.gnu.org/ [following]
--2013-04-25 21:37:36-- http://www.gnu.org/
Resolving www.gnu.org... [...]
Reusing existing connection to gnu.org:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘index.html’
[ <=> ] 18 506 56,6KB/s in 0,3s
2013-04-25 21:37:36 (56,6 KB/s) - ‘index.html’ saved [18506]
Removing index.html since it should be rejected.
FINISHED --2013-04-25 21:37:36--
Total wall clock time: 0,8s
Downloaded: 1 files, 18K in 0,3s (56,6 KB/s)
For some reason I had successful downloadings from gnu.org. But it stops working after 1 or more attempts.
Offline