You are not logged in.
Pages: 1
I've been trying to test a list of URLs for dead hosts. Although there are existing programs to do this, such as linkchecker, I want to make something faster and specific only to checking for dead root domains. I'm not concerned with recursing into the links; I only need to check if the domains are dead.
This is what I use:
cat url-list | parallel -j 100 "wget --spider --header='User-Agent: Mozilla/5.0' -a logfile {}"
It appears to work ok, although the parallel jobs seem to hang near the end. Am I implementing parallel incorrectly?
I grep the log looking for errors such as:
"Unable to resolve host address"
cat logfile | grep -i "unable to resolve host address" | grep -C 1 www. | grep -o "www.*" | sed -e "s/www.//" -e "s/.$//"
"Name or service not known"
cat logfile | grep -i "Name or service not known" | grep -C 1 www. | grep -o "www.*"| sed -e 's/|.*\|\.\.\..*//' -e 's/www.//'
"No address associated with hostname"
cat logfile | grep -i "No address associated with hostname." | grep -C 1 www. | grep -o "www.*"| sed -e 's/|.*\|\.\.\..*//' -e 's/www.//'
I'm unsure, however, which of these correspond to the HTML code for a domain that is actually completely dead. To reiterate: dead domains are all I am concerned with. Sadly, when I manually test the links that wget reports as dead in a browser, a large majority of them are actually working; this sadly makes my efforts useless so far. I'm wondering if wget is just unreliable as a spider and whether I should switch to something else. Or perhaps I just need to mess around with timeout values and other parameters. Any ideas?
Last edited by rodyaj (2012-09-15 09:12:43)
Offline
The failures you show appear to be DNS failures (host name look up failures) which can happen if your name server has a problem. It doesn't necessarily mean the site is gone.
You could probably use 'host' instead of 'wget' to see if the look ups would be successful.
I haven't used parallel so I can't help you with that.
Offline
Thanks rockin turtle. Looks exactly like what I need.
Offline
Pages: 1