You are not logged in.
So I'm doing some SEO work I've been tasked with checking a couple hundred thousand back links for quality. I found myself spending a lot of time navigating to sites that no longer existed. I figured I would make a bash script that checks if the links are active first. The problem is my script is slower than evolution. I'm no bash guru or anything so I figured maybe I would see if there are some optimizations you folks can think of. Here is what I am working with:
#!/bin/bash
while read line
do
#pull page source and grep for domain
curl -s "$line" | grep "example.com"
if [[ $? -eq 0 ]]
then
echo \"$line\",\"link active\" >> csv.csv else
echo \"$line\",\"REMOVED\" >> csv.csv
fi
done < <(cat links.txt)
Can you guys think of another way of doing this that might be quicker? I realize the bottleneck is curl (as well as the speed of the remote server/dns servers) and that there isn't really a way around that. Is there another tool or technique I could use within my script to speed up this process?
I will still have to go through the active links one by one and analyze by hand but I don't think there is a good way of doing this programmatically that wouldn't consume more time than doing it by hand (if it's even possible).
Thanks
Offline
I had a pretty severe cringe from reading "< <(cat links.txt)" which can be simply "< links.txt". You're cat'ing a file to be output from a procress, then using process substitution to make the output of a process look like a file.
But - the nails on the chalkboard effect aside - this would not speed up the script. Bash just seems the wrong tool for this job.
At the very least, if it must be bash, you'd probably want to multi-task it. Make a function to check grab and scan the website content, then loop through the input calling and backgrounding that function for each entry. This will spawn *many* bash processes, which could actually do more harm than good, but in almost any other language such a multi-thread approach would be helpful.
Last edited by Trilby (2013-04-12 23:09:20)
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
I would do this in Python or something very similar with threading support.
Create n worker threads that read from a central input queue, query the URL, and append the results to a central output thread, where n is chosen as a function of performance (I'd start with 100).
Load the input queue with URLs from a file, then just read the output queue until it's done and update the file as results come in.
If you just want to check link status then you could use HEAD requests to avoid retrieving the full HTML body, which would also speed things up.
My Arch Linux Stuff • Forum Etiquette • Community Ethos - Arch is not for everyone
Offline
Sorry about the nails on the chalkboard. In my previous iteration of the script I was using awk to grab files from a different csv until I realized that it was probably just better to go with a regular text file. It would appear I defeated the purpose of doing that inadvertently.
Can you recommend a language for the job? I've got some experience with c and c++ but I've never really worked with multi-threading, just made some simple 2d games. I may not be able to pull it off with my current skill level but it may be a fun weekend project. It would be interesting to see if I could rig up 4 - 5 VPS to all split up the work... Now I'm just dreaming.
Offline
I would do this in Python or something very similar with threading support.
Create n worker threads that read from a central input queue, query the URL, and append the results to a central output thread, where n is chosen as a function of performance (I'd start with 100).Load the input queue with URLs from a file, then just read the output queue until it's done and update the file as results come in.
This is a really great idea. I've been looking for an excuse to learn some python.
If you just want to check link status then you could use HEAD requests to avoid retrieving the full HTML body, which would also speed things up.
Checking the HEAD requests wouldn't work in my case as I'm checking the source for the presence of a link to our site. If one is detected I have to manually check each site for quality to find out if I need to request that the link is removed.
Thanks a lot for this response. Very helpful.
Offline
Since starting this thread my vps has burned through ~4000 of them (out of my current batch of 60K). If I leave it going hopefully it will be done by monday. Maybe that will give me enough time to write something that doesn't go so slow.
I'm taking a look at some python tutorials now. Hopefully I can get something going with it. Well see...
EDIT: Sorry about that... I probably should have added those comments as edits.
Last edited by instantaphex (2013-04-12 23:49:30)
Offline
Here's a working example of what I proposed above:
#!/usr/bin/env python3
from threading import Thread
from queue import Queue
from urllib.request import Request, urlopen
from urllib.error import URLError
import csv
NUMBER_OF_WORKERS = 100
GOOD_STATUS = 'link active'
BAD_STATUS = 'REMOVED'
INPUT_FILE = 'links.txt'
OUTPUT_FILE = 'csv.csv'
DOMAIN = 'example.com'
def worker(input_queue, output_queue):
while True:
url = input_queue.get()
# None indicates the end of the input queue.
if url is None:
input_queue.task_done()
input_queue.put(None)
break
# request = Request(url, method='HEAD')
request = Request(url)
try:
with urlopen(request) as u:
if DOMAIN in u.read().decode():
status = GOOD_STATUS
else:
status = BAD_STATUS
except URLError:
status = BAD_STATUS
output_queue.put((url, status))
input_queue.task_done()
input_queue = Queue()
output_queue = Queue()
for i in range(NUMBER_OF_WORKERS):
t = Thread(target=worker, args=(input_queue, output_queue))
t.daemon = True
t.start()
number_of_urls = 0
with open(INPUT_FILE, 'r') as f:
for line in f:
input_queue.put(line.strip())
number_of_urls += 1
# Indicate the end.
input_queue.put(None)
with open(OUTPUT_FILE, 'a') as f:
c = csv.writer(f, delimiter=',', quotechar='"')
for i in range(number_of_urls):
url, status = output_queue.get()
print('{}\n {}'.format(url, status))
c.writerow((url, status))
output_queue.task_done()
# Remove final join None
input_queue.get()
input_queue.task_done()
input_queue.join()
output_queue.join()
Note that the script will append to the output file rather than overwrite it. Change the 'a' to 'w' if you want to truncate it first.
You may also want to distinguish between bad links on retrieved pages and URL errors due to failed retrievals (e.g. to check them again later). Just add a failure message in the URLError exception block.
You can do the following as learning exercises:
use the argparse module to make the hard-coded variables at the top configurable
check for specific HTTP status codes and other errors in the URLError block and handle accordingly, e.g. by requeuing URLS at the end of the input queue when you encounter transient errors.
Ask if you have any questions.
Last edited by Xyne (2013-04-15 21:21:53)
My Arch Linux Stuff • Forum Etiquette • Community Ethos - Arch is not for everyone
Offline
Here's a working example of what I proposed above:
#!/usr/bin/env python3 from threading import Thread from queue import Queue from urllib.request import Request, urlopen from urllib.error import URLError import csv NUMBER_OF_WORKERS = 100 GOOD_STATUS = 'link active' BAD_STATUS = 'REMOVED' INPUT_FILE = 'links.txt' OUTPUT_FILE = 'csv.csv' DOMAIN = 'example.com' def worker(input_queue, output_queue): while True: url = input_queue.get() # None indicates the end of the input queue. if url is None: input_queue.task_done() input_queue.put(None) break # request = Request(url, method='HEAD') request = Request(url) try: with urlopen(request) as u: if DOMAIN in u.read(): status = GOOD_STATUS else: status = BAD_STATUS except URLError: status = BAD_STATUS output_queue.put((url, status)) input_queue.task_done() input_queue = Queue() output_queue = Queue() for i in range(NUMBER_OF_WORKERS): t = Thread(target=worker, args=(input_queue, output_queue)) t.daemon = True t.start() number_of_urls = 0 with open(INPUT_FILE, 'r') as f: for line in f: input_queue.put(line.strip()) number_of_urls += 1 # Indicate the end. input_queue.put(None) with open(OUTPUT_FILE, 'a') as f: c = csv.writer(f, delimiter=',', quotechar='"') for i in range(number_of_urls): url, status = output_queue.get() print('{}\n {}'.format(url, status)) c.writerow((url, status)) output_queue.task_done() # Remove final join None input_queue.get() input_queue.task_done() input_queue.join() output_queue.join()
Note that the script will append to the output file rather than overwrite it. Change the 'a' to 'w' if you want to truncate it first.
You may also want to distinguish between bad links on retrieved pages and URL errors due to failed retrievals (e.g. to check them again later). Just add a failure message in the URLError exception block.
You can do the following as learning exercises:
use the argparse module to make the hard-coded variables at the top configurable
check for specific HTTP status codes and other errors in the URLError block and handle accordingly, e.g. by requeuing URLS at the end of the input queue when you encounter transient errors.
Ask if you have any questions.
Wow... I can't thank you enough! This is going to be a huge help! While you wrote that whole script I was only able to get as far as a python script that grabs a URL and prints the source to STDOUT. I'm definitely going to dig into this and see what I can learn.
Offline
I recently started a Python project called lnkckr to help me find broken links in my blog posts. It also supports extract links from HTML file or a URL, or a plain text file of list of URLs. Threaded, following 3xx and checking the final URL, and also checking if the #fragment really in HTML, which I found it's as bad as the 404 links, therefore I added that checking.
In the same repo, it also includes an old Bash script linkckr.sh similar to yours, but little faster because it only asks for header and timeout is set. There are quite some slow server out there for whatever reason being so slow to respond.
Anyway, hope this helps.
Offline
I'm getting a ton of these errors:
Exception in thread Thread-56:
Traceback (most recent call last):
File "/usr/local/Cellar/python3/3.3.0/Frameworks/Python.framework/Versions/3.3/lib/python3.3/threading.py", line 639
, in _bootstrap_inner
self.run()
File "/usr/local/Cellar/python3/3.3.0/Frameworks/Python.framework/Versions/3.3/lib/python3.3/threading.py", line 596
, in run
self._target(*self._args, **self._kwargs)
File "./test.py", line 31, in worker
if DOMAIN in u.read():
TypeError: Type str doesn't support the buffer API
Any ideas?
Offline
I quickly converted the script from checking HEAD requests to checking the content before I posted it and completely forgot about encoding. "urlopen" returns a handle to the stream. Reading from the stream returns bytestrings (strings of raw bytes) instead of regular character strings, but it has to be the same type as the query string. I have updated my code above to decode the bytestrings. Note that you could go the other way and convert the query string to a bytestring. This would likely be a little faster because it would not have to decode the page content, but the bottleneck is almost certainly the bandwidth so it is probably negligible.
My Arch Linux Stuff • Forum Etiquette • Community Ethos - Arch is not for everyone
Offline
I'm getting a ton of these errors:
Exception in thread Thread-56:
Traceback (most recent call last):
File "/usr/local/Cellar/python3/3.3.0/Frameworks/Python.framework/Versions/3.3/lib/python3.3/threading.py", line 639
, in _bootstrap_inner
self.run()
File "/usr/local/Cellar/python3/3.3.0/Frameworks/Python.framework/Versions/3.3/lib/python3.3/threading.py", line 596
, in run
self._target(*self._args, **self._kwargs)
File "./test.py", line 31, in worker
if DOMAIN in u.read():
TypeError: Type str doesn't support the buffer API
I read somewhere that urlopen may be returning utf-8 encoded source so I changed line 31 from:
if DOMAIN in u.read():
to:
if DOMAIN in u.read().decode("utf-8")
and now some of the URLs are spitting out this error:
Exception in thread Thread-68:
Traceback (most recent call last):
File "/usr/local/Cellar/python3/3.3.0/Frameworks/Python.framework/Versions/3.3/lib/python3.3/threading.py", line 639, in _bootstrap_inner
self.run()
File "/usr/local/Cellar/python3/3.3.0/Frameworks/Python.framework/Versions/3.3/lib/python3.3/threading.py", line 596, in run
self._target(*self._args, **self._kwargs)
File "./test.py", line 31, in worker
if DOMAIN in u.read().decode("utf-8"):
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 7042: invalid start byte
It looks like only some of them need to be decoded. How can I detect which need to be decoded and which don't?
EDIT: So I've now type cast the u.read() as a string:
if DOMAIN in str(u.read()):
It seems to be working alright at this point. I'll update with the progress.
Last edited by instantaphex (2013-04-15 22:10:30)
Offline
I quickly converted the script from checking HEAD requests to checking the content before I posted it and completely forgot about encoding. "urlopen" returns a handle to the stream. Reading from the stream returns bytestrings (strings of raw bytes) instead of regular character strings, but it has to be the same type as the query string. I have updated my code above to decode the bytestrings. Note that you could go the other way and convert the query string to a bytestring. This would likely be a little faster because it would not have to decode the page content, but the bottleneck is almost certainly the bandwidth so it is probably negligible.
I just saw that you posted this. I was getting the same error with u.read().decode() as I was with u.read().decode("utf-8"). I suppose it uses utf-8 by default? Either way, those didn't quite work. Can you see any potential issues with typecasting to a string?
Offline
I know it's been awhile but I've found myself in need of this yet again. I've modified Xyne's script a little to work a little more consistently with my data. The problem I'm running into now is that urllib doesn't accept IRIs. No surprise there I suppose as it's urllib and not irilib. Does anyone know of any libraries that will convert an IRI to a URI? Here is the code I am working with:
#!/usr/bin/env python3
from threading import Thread
from queue import Queue
from urllib.request import Request, urlopen
from urllib.error import URLError
import csv
import sys
import argparse
parser = argparse.ArgumentParser(description='Check list of URLs for existence of link in html')
parser.add_argument('-d','--domain', help='The domain you would like to search for a link to', required=True)
parser.add_argument('-i','--input', help='Text file with list of URLs to check', required=True)
parser.add_argument('-o','--output', help='Named of csv to output results to', required=True)
parser.add_argument('-v','--verbose', help='Display URLs and statuses in the terminal', required=False, action='store_true')
ARGS = vars(parser.parse_args())
INFILE = ARGS['input']
OUTFILE = ARGS['output']
DOMAIN = ARGS['domain']
REMOVED = 'REMOVED'
EXISTS = 'EXISTS'
NUMBER_OF_WORKERS = 50
#Workers
def worker(input_queue, output_queue):
while True:
url = input_queue.get()
if url is None:
input_queue.task_done()
input_queue.put(None)
break
request = Request(url)
try:
response = urlopen(request)
html = str(response.read())
if DOMAIN in html:
status = EXISTS
else:
status = REMOVED
except URLError:
status = REMOVED
output_queue.put((url, status))
input_queue.task_done()
#Queues
input_queue = Queue()
output_queue = Queue()
#Create threads
for i in range(NUMBER_OF_WORKERS):
t = Thread(target=worker, args=(input_queue, output_queue))
t.daemon = True
t.start()
#Populate input queue
number_of_urls = 0
with open(INFILE, 'r') as f:
for line in f:
input_queue.put(line.strip())
number_of_urls += 1
input_queue.put(None)
#Write URL and Status to csv file
with open(OUTFILE, 'a') as f:
c = csv.writer(f, delimiter=',', quotechar='"')
for i in range(number_of_urls):
url, status = output_queue.get()
if ARGS['verbose']:
print('{}\n {}'.format(url, status))
c.writerow((url, status))
output_queue.task_done()
input_queue.get()
input_queue.task_done()
input_queue.join()
output_queue.join()
The problem seems to be when I use urlopen
response = urlopen(request)
with a URL like http://www.rafo.co.il/בר-פאלי-p1
urlopen fails with an error like this:
Exception in thread Thread-19:
Traceback (most recent call last):
File "/usr/local/Cellar/python3/3.3.0/Frameworks/Python.framework/Versions/3.3/lib/python3.3/threading.py", line 639, in _bootstrap_inner
self.run()
File "/usr/local/Cellar/python3/3.3.0/Frameworks/Python.framework/Versions/3.3/lib/python3.3/threading.py", line 596, in run
self._target(*self._args, **self._kwargs)
File "./linkcheck.py", line 35, in worker
response = urlopen(request)
File "/usr/local/Cellar/python3/3.3.0/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 160, in urlopen
return opener.open(url, data, timeout)
File "/usr/local/Cellar/python3/3.3.0/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 473, in open
response = self._open(req, data)
File "/usr/local/Cellar/python3/3.3.0/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 491, in _open
'_open', req)
File "/usr/local/Cellar/python3/3.3.0/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 451, in _call_chain
result = func(*args)
File "/usr/local/Cellar/python3/3.3.0/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 1272, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/local/Cellar/python3/3.3.0/Frameworks/Python.framework/Versions/3.3/lib/python3.3/urllib/request.py", line 1252, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
File "/usr/local/Cellar/python3/3.3.0/Frameworks/Python.framework/Versions/3.3/lib/python3.3/http/client.py", line 1049, in request
self._send_request(method, url, body, headers)
File "/usr/local/Cellar/python3/3.3.0/Frameworks/Python.framework/Versions/3.3/lib/python3.3/http/client.py", line 1077, in _send_request
self.putrequest(method, url, **skips)
File "/usr/local/Cellar/python3/3.3.0/Frameworks/Python.framework/Versions/3.3/lib/python3.3/http/client.py", line 941, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 5-8: ordinal not in range(128)
I'm not too familiar with how character encoding works so I'm not sure where to start. What would be a quick and dirty way (if one exists) to get URLs like this to play nicely with python's urlopen?
Offline
I'm on my way out right now so I can't test anything, but I would check that the input file is in UTF-8 and then check what's actually being read and queued here:
with open(INFILE, 'r') as f:
for line in f:
input_queue.put(line.strip())
number_of_urls += 1
You may need to play with the "encode" and "decode" messages. I'll try to remember to take a look at this later when I have some time. It would be useful if you could post an example input file as well. A direct link to the file would be best so that others can check the actual encoding. You don't need to include all of the URLs, just a few that cause the error.
I haven't tested this in the script itself, but percent-encoding the URL avoids the URLEncode error with the example URL that you have given:
#!/usr/bin/env python3
# -*- encoding: utf-8 -*-
import urllib.request
import urllib.parse
def url_sanitize(url):
parsed = urllib.parse.urlparse(url)
return urllib.parse.urlunparse(urllib.parse.quote(x) for x in parsed)
url = 'http://www.rafo.co.il/בר-פאלי-p1'
with urllib.request.urlopen(url_sanitize(url)) as f:
print(f.read().decode())
Last edited by Xyne (2013-11-17 16:37:36)
My Arch Linux Stuff • Forum Etiquette • Community Ethos - Arch is not for everyone
Offline
Hey thanks man. You've really got me interested in python. Rarely have I seen so few lines of code do so much. That urlparse/unparse is going to be a huge help. Passing the anonymous function into urlunparse() had me scratching my head for a second but that's pretty cool.
Offline