I could download recursively with wget and now I can't...

Mr. Alex · 2013-04-25 16:44:19

Hello.

I tried recursive downloading with wget and it worked first time well. Then I stopped the process and tried again and it refuses to work now. For example:

$ wget -r https://en.wikipedia.org         
--2013-04-25 20:37:59--  https://en.wikipedia.org/
Resolving en.wikipedia.org... [...]
Connecting to en.wikipedia.org|[...]... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://en.wikipedia.org/wiki/Main_Page [following]
--2013-04-25 20:38:00--  https://en.wikipedia.org/wiki/Main_Page
Reusing existing connection to en.wikipedia.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 63653 (62K) [text/html]
Saving to: ‘en.wikipedia.org/index.html’

100%[=========================================================================================================>] 63 653       289KB/s   in 0,2s   

2013-04-25 20:38:00 (289 KB/s) - ‘en.wikipedia.org/index.html’ saved [63653/63653]

Loading robots.txt; please ignore errors.
--2013-04-25 20:38:00--  https://en.wikipedia.org/robots.txt
Reusing existing connection to en.wikipedia.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 28244 (28K) [text/plain]
Saving to: ‘en.wikipedia.org/robots.txt’

100%[=========================================================================================================>] 28 244      --.-K/s   in 0,07s   

2013-04-25 20:38:00 (379 KB/s) - ‘en.wikipedia.org/robots.txt’ saved [28244/28244]

FINISHED --2013-04-25 20:38:00--
Total wall clock time: 0,8s
Downloaded: 2 files, 90K in 0,3s (312 KB/s)

That was it. First time it worked and started downloading files recursively. And now it stops right there. Why?

WonderWoofy · 2013-04-25 16:50:56

It says the default depth is 5, but maybe something is wonky... try specifying "--recursive --level=N" where N is the depth of directories you desire...

Maybe you should specify what the directory structure looks like, so that an analysis can be made.

Edit: Nevermind, read the robots.txt file it downloaded for you. It contains this

en.wikipedia.org/robots.txt wrote:

# Sorry, wget in its recursive mode is a frequent problem.
# Please read the man page and use it properly; there is a
# --wait option you can use to set the delay between hits,
# for instance.
#
User-agent: wget
Disallow: /

Last edited by WonderWoofy (2013-04-25 16:51:48)

Mr. Alex · 2013-04-25 17:42:31

OK, what about gnu.org? It says in robots.txt:

# robots.txt for http://www.gnu.org/

User-agent: *
Crawl-delay: 4
Disallow: /private/

User-agent: *
Crawl-delay: 4
Disallow: /savannah-checkouts/

So I guess nothing should prevent downloading... But I get just

$ wget -r -nd -A jpg,jpeg http://gnu.org         
--2013-04-25 21:37:36--  http://gnu.org/
Resolving gnu.org... [...]
Connecting to gnu.org|[...]|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://www.gnu.org/ [following]
--2013-04-25 21:37:36--  http://www.gnu.org/
Resolving www.gnu.org... [...]
Reusing existing connection to gnu.org:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘index.html’

    [  <=>                                                                                                     ] 18 506      56,6KB/s   in 0,3s   

2013-04-25 21:37:36 (56,6 KB/s) - ‘index.html’ saved [18506]

Removing index.html since it should be rejected.

FINISHED --2013-04-25 21:37:36--
Total wall clock time: 0,8s
Downloaded: 1 files, 18K in 0,3s (56,6 KB/s)

For some reason I had successful downloadings from gnu.org. But it stops working after 1 or more attempts.

Arch Linux

#1 2013-04-25 16:44:19

I could download recursively with wget and now I can't...

#2 2013-04-25 16:50:56

Re: I could download recursively with wget and now I can't...

#3 2013-04-25 17:42:31

Re: I could download recursively with wget and now I can't...

Board footer