Parsing HTML from search engines for valid search results

xen · 2008-11-19 13:59:06

Hey fellow archers!

Basically, I have 4 large text buffers full of HTML, each from a different search engine. (Google, Yahoo, Ask and AltaVista respectively). The buffers were filled using sockets and HTTP requests, but you can check the contents of these buffers by doing a search on either of the engines and viewing the page source, its the same thing!

For an assignment which is almost entirely finished, I need to extract valid search results from these buffers. This essentially means getting rid of the unneeded faff. This is easier than it sounds because none of the engines present their results in a valid way. Yahoo is particularly difficult as the results are nested in a load of messy Javascript.

The language of implementation is C. This is actually a fairly small part of a larger assignment and not massively important, but I MUST get it working for my own self-worth

- I must parse the HTML, not XML or using any special API provided by the search engine, because I have to demonstrate sound knowledge of communicating via HTTP (already done!).

Does anyone have any suggestions on how this might be achieved? I got it pretty much working for Google, only to find this approach is no good for Yahoo due to the messy presentation. I was essentially using strstr looking for occurences of certain tags that I knew were unique to Google's results.

Thanks in advance guys!

xen · 2008-11-19 20:04:26

I have sorted this problem using 2 libraries; tidylib and libxml2.

1. Convert HTML to XHTML with tidylib.
2. Parse XHTML using (badly) documented libxml2 HTML routines.
3. Iterate the elements and process the links accordingly.

I would like to share the code with this, but it will have to wait until my assignment has been graded

poopship21 · 2009-07-27 00:21:24

this kind of thing would be a useful tool in its own right. i have often wished for a simple console-based interface to web search that one could implement like "locate <keyword> -l" and would be amenable to pipe-ing. the issues that you encounter are an obstacle to simple web searches even with a full-fledged text-based browser as so much of content is layout dependent, ineviably leading to breakage or at least messy navigation of results-pages.

Arch Linux

#1 2008-11-19 13:59:06

Parsing HTML from search engines for valid search results

#2 2008-11-19 20:04:26

Re: Parsing HTML from search engines for valid search results

#3 2009-07-27 00:21:24

Re: Parsing HTML from search engines for valid search results

Board footer