You are not logged in.

#1 2012-09-15 09:08:59

rodyaj
Member
Registered: 2009-10-13
Posts: 54

wget as a web spider

I've been trying to test a list of URLs for dead hosts. Although there are existing programs to do this, such as linkchecker, I want to make something faster and specific only to checking for dead root domains. I'm not concerned with recursing into the links; I only need to check if the domains are dead.

This is what I use:

cat url-list | parallel -j 100 "wget --spider --header='User-Agent: Mozilla/5.0' -a logfile {}"

It appears to work ok, although the parallel jobs seem to hang near the end. Am I implementing parallel incorrectly?

I grep the log looking for errors such as:

"Unable to resolve host address"

cat logfile | grep -i "unable to resolve host address" | grep -C 1 www. | grep -o "www.*" | sed -e "s/www.//" -e "s/.$//"

"Name or service not known"

cat logfile | grep -i "Name or service not known" | grep -C 1 www. | grep -o "www.*"| sed -e 's/|.*\|\.\.\..*//' -e 's/www.//'

"No address associated with hostname"

cat logfile | grep -i "No address associated with hostname." | grep -C 1 www. | grep -o "www.*"| sed -e 's/|.*\|\.\.\..*//' -e 's/www.//'

I'm unsure, however, which of these correspond to the HTML code for a domain that is actually completely dead. To reiterate: dead domains are all I am concerned with. Sadly, when I manually test the links that wget reports as dead in a browser, a large majority of them are actually working; this sadly makes my efforts useless so far. I'm wondering if wget is just unreliable as a spider and whether I should switch to something else. Or perhaps I just need to mess around with timeout values and other parameters. Any ideas?

Last edited by rodyaj (2012-09-15 09:12:43)

Offline

#2 2012-09-16 05:46:26

rockin turtle
Member
From: Montana, USA
Registered: 2009-10-22
Posts: 227

Re: wget as a web spider

The failures you show appear to be DNS failures (host name look up failures) which can happen if your name server has a problem.  It doesn't necessarily mean the site is gone.

You could probably use 'host' instead of 'wget' to see if the look ups would be successful.

I haven't used parallel so I can't help you with that.

Offline

#3 2012-09-16 12:39:39

rodyaj
Member
Registered: 2009-10-13
Posts: 54

Re: wget as a web spider

Thanks rockin turtle. Looks exactly like what I need.

Offline

Board footer

Powered by FluxBB