Python, ThreadPools and Reactors - Oh My!

synorgy · 2011-05-18 02:08:25

Okay - So I'm writing a web scraper in python, using httplib2 and lxml (yes - I know I could be using scrapy. Let's move past that...) The scraper has about 15000 pages to parse into approximately 400,000 items. I've got the code to parse the items to run instantaneously (almost) but the portion that downloads the page from the server is still extremely slow. I'd like to overcome that through concurrency. However, I can't rely on EVERY page needing to be parsed EVERY time. I've tried with a single ThreadPool (like multiprocessing.pool, but done with threads - which should be fine since this is an I/O bound process), but I couldn't think of a graceful (or working) way of getting ALL of the threads to stop when the date of the last index item was greater than the item we were processing. Right now, I'm working on a method using two instances of ThreadPool - one to download each page, and another to parse the pages. A simplified code example is:

#! /usr/bin/env python2

import httplib2
from Queue import PriorityQueue
from multiprocessing.pool import ThreadPool
from lxml.html import fromstring

pages = [x for x in range(1000)]
page_queue = PriorityQueue(1000)

url = "http://www.google.com"

def get_page(page):
    #Grabs google.com
    h = httplib2.Http(".cache")
    resp, content = h.request(url, "GET")
    tree = fromstring(str(content), base_url=url)
    page_queue.put((page, tree))

def parse_page():
    page_num, page = page_queue.get()
    print "Parsing page #" + str(page_num)
    print page_queue.qsize()
    #do more stuff with the page here
    page_queue.task_done()

if __name__ == "__main__":
    collect_pool = ThreadPool()
    collect_pool.map_async(get_page, pages)
    collect_pool.close()

    parse_pool = ThreadPool()
    parse_pool.apply_async(parse_page)
    parse_pool.close()

    parse_pool.join()
    collect_pool.join()
    page_queue.join()

Running this code however, doesn't do what I expect - which is to fire off two threadpools: one populating a queue and another pulling from it to parse. It begins the collect pool and runs through it and then begins the parse_pool and runs through it (I assume, I've not let the code run long enough to get to the parse_pool - the point is that collect_pool is all that seems to be running). I'm fairly sure I've messed something up with the order of the calls to join(), but I can't for the life of me figure out what order they're supposed to be in.

My question is essentially this: Am I barking up the right tree here? and if so, what the hell am I doing wrong? If I'm not - what would your suggestions be?

synorgy · 2011-05-20 00:22:16

Well, I've discovered part of my problem. map_async blocks until it's finished processing. That means I need to slightly restructure things, but doesn't answer whether this is a dumb way to do things.

cactus · 2011-05-20 01:46:15

Using threads in python isn't really going to help you much.
My recommendation is to either use multiprocessing, use separate 'duty specific' instances (controller, separately spawned worker instances, job queue), or look at something like gevent (async/event-based io) so you can fetch pages without blocking.

aspidites · 2011-05-20 20:52:10

I'm not an expert at this by any means, just curious: Rather than process each 'task' separately (which does make sense, btw), how does the script perform if you run the tasks together, but call them asynchronously?

That is, instead of joining the three tasks, run them sequentially, each performed by a thread pool:

pool = Pool(processes=4)
pool.apply_async(get_page, [pages])
pool.apply_async(parse_page)

Or does the selcond apply_async call not wait on the first to finish before executing? I'm not currently able to test this.

Again, this is more so my curiosity speaking, not a sound piece of advice.

Last edited by aspidites (2011-05-20 20:52:50)

synorgy · 2011-05-21 21:50:12

1) apply_async doesn't block at all (you'd have to call pool.join() later to wait for it to process all of the pages)
2) apply_async only takes one set of arguments - which is why I'm using map_async (which, oddly enough, DOES block until processing is done)
3) I can't do that, because I want to STOP processing EVERYTHING when I encounter a page that meets a set criteria (has a date older than my last updated date, so I don't process everything)

WRT to threads not being right for this: Personally, I believe that threads are well suited to this task. The whole process is I/O bound, meaning that we spend most of our time waiting outside of the interpreter, so the GIL doesn't ever really come into play. The processing I need to do on the items is extremely minimal and finishes nearly instantaneously - furthering the argument against the GIL becoming the problem. Personally, Processes are too much overhead (since they each get their own stack etc) and a reactor based system makes sense, but looks too difficult for me to want to implement and makes no more sense to me in this situation than threads do.

What I'm considering trying now: aspidites idea of reusing the same thread pool. I'm not sure I can use it, because the network request take so much time - even concurrently. (I DO have 15000 or so requests to make >_>).

...and just using scrapy.

sederek · 2011-09-09 02:51:11

Hello friend, this may be a bit late but I have a suggestion. Since waiting HTTP responses make your application I/O bound, then you have to use time wisely. I have tried several solutions like threads, processes - I finally came upon Twisted! I think a workflow like this should make your application robust:

Module using Twisted --> 1st Queue
/
/
/
/
Module to process received responses --> 2nd Queue

Twisted is an async library which has basic HTTP client capabilities and ability to limit concurrent requests. You can feed a queue object with received responses and at the same time, simultaneously working threads consuming these responses from this queue can process them and send final results to a 2nd queue, which you can output results from.

Like I said I tried httplib2, native urllib, pycurl and curl by calling it via subprocess module from the command line - using Twisted now I have rock solid request - response system and since my application didn't need so much processing with received data - I handled this in Twisted's method callback upon request completion feature.

Last edited by sederek (2011-09-09 02:52:09)

Arch Linux

#1 2011-05-18 02:08:25

Python, ThreadPools and Reactors - Oh My!

#2 2011-05-20 00:22:16

Re: Python, ThreadPools and Reactors - Oh My!

#3 2011-05-20 01:46:15

Re: Python, ThreadPools and Reactors - Oh My!

#4 2011-05-20 20:52:10

Re: Python, ThreadPools and Reactors - Oh My!

#5 2011-05-21 21:50:12

Re: Python, ThreadPools and Reactors - Oh My!

#6 2011-09-09 02:51:11

Re: Python, ThreadPools and Reactors - Oh My!

Board footer