You are not logged in.

#1 2011-10-08 23:27:51

hueck
Member
Registered: 2011-09-15
Posts: 12

Python 3 - Parse search output

Hey guys,

i know there are still a lot of helper out there for the AUR but i just do it for fun and to learn something wink

My problem is, that i dont know how the use the following output of my method

Method

def showSearchResults(keyword):
    f = urllib.request.urlopen('http://aur.archlinux.org/packages.php?O=0&K='+keyword+'&do_Search=Los')
    for line in f:
            print(line)
        
  
          
searchPattern = sys.argv[1]
print('you were searching for: ' + searchPattern)
showSearchResults(searchPattern)

Result with Groovy

you were searching for: groovy
b'<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"\n'
b' "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\n'
b'<html xmlns="http://www.w3.org/1999/xhtml"\n'
b'\txml:lang="en" lang="en">\n'
b'  <head>\n'
b'    <title>AUR (en) - Search Criteria: groovy</title>\n'
b"\t<link rel='stylesheet' type='text/css' href='css/fonts.css' />\n"
b"\t<link rel='stylesheet' type='text/css' href='css/containers.css' />\n"
b"\t<link rel='stylesheet' type='text/css' href='css/arch.css' />\n"
b"\t<link rel='stylesheet' type='text/css' href='css/archnavbar/archnavbar.css' />\n"
b"\t<link rel='shortcut icon' href='images/favicon.ico' />\n"
b"\t<link rel='alternate' type='application/rss+xml' title='Newest Packages RSS' href='rss.php' />\n"
b'\t<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />\n'
b'  </head>\n'
b'\t<body>\n'
b'\t\t<div id="archnavbar" class="anb-aur">\n'
b'\t\t\t<div id="archnavbarlogo"><h1><a href="/" title="Return to the main page">Arch Linux</a></h1></div>\n'
b'\t\t\t<div id="archnavbarmenu">\n'
b'\t\t\t\t<ul id="archnavbarlist">\n'
b'\t\t\t\t\t<li id="anb-home"><a href="http://www.archlinux.org/" title="Arch news, packages, projects and more">Home</a></li>\n'
b'\t\t\t\t\t<li id="anb-packages"><a href="http://www.archlinux.org/packages/" title="Arch Package Database">Packages</a></li>\n'
b'\t\t\t\t\t<li id="anb-forums"><a href="https://bbs.archlinux.org/" title="Community forums">Forums</a></li>\n'
b'\t\t\t\t\t<li id="anb-wiki"><a href="https://wiki.archlinux.org/" title="Community documentation">Wiki</a></li>\n'
b'\t\t\t\t\t<li id="anb-bugs"><a href="https://bugs.archlinux.org/" title="Report and track bugs">Bugs</a></li>\n'
b'\t\t\t\t\t<li id="anb-aur"><a href="https://aur.archlinux.org/" title="Arch Linux User Repository">AUR</a></li>\n'
b'\t\t\t\t\t<li id="anb-download"><a href="http://www.archlinux.org/download/" title="Get Arch Linux">Download</a></li>\n'
b'\t\t\t\t</ul>\n'
b'\t\t\t</div>\n'
b'\t\t</div><!-- #archnavbar -->\n'
b'\n'
b'\t\t<div id="archdev-navbar">\n'
b'\t\t\t<ul>\n'
b'\t\t\t\t<li><a href="index.php">AUR Home</a></li>\n'
b'\t\t\t\t<li><a href="account.php">Accounts</a></li>\n'
b'\t\t\t\t<li><a href="packages.php">Packages</a></li>\n'
b'\t\t\t\t<li><a href="http://bugs.archlinux.org/index.php?tasks=all&amp;project=2">Bugs</a></li>\n'
b'\t\t\t\t<li><a href="http://archlinux.org/mailman/listinfo/aur-general">Discussion</a></li>\n'
b'\t\t\t\t\t\t\t</ul>\n'
b'\t\t</div><!-- #archdev-navbar -->\n'
b'\n'
b'\t\t<div id="login_bar" class="pgbox">\n'
b"<span class='error'>\n"
b'\tHTTP login is disabled. Please <a href="https://aur.archlinux.org/packages.php?O=0&amp;K=groovy&amp;do_Search=Los">switch to HTTPs</a> if you want to login.</span>\n'
b'</div>\n'
b'\n'
b'\t<div id="lang_sub">\n'
b'<a href="/packages.php?setlang=ca" title="Catal\xc3\xa0">ca</a>\n'
b'<a href="/packages.php?setlang=cs" title="\xc4\x8desky">cs</a>\n'
b'<a href="/packages.php?setlang=da" title="Dansk">da</a>\n'
b'<a href="/packages.php?setlang=de" title="Deutsch">de</a>\n'
b'<a href="/packages.php?setlang=en" title="English">en</a>\n'
b'<a href="/packages.php?setlang=el" title="\xce\x95\xce\xbb\xce\xbb\xce\xb7\xce\xbd\xce\xb9\xce\xba\xce\xac">el</a>\n'
b'<a href="/packages.php?setlang=es" title="Espa\xc3\xb1ol">es</a>\n'
b'<a href="/packages.php?setlang=fi" title="Finnish">fi</a>\n'
b'<a href="/packages.php?setlang=fr" title="Fran\xc3\xa7ais">fr</a>\n'
b'<a href="/packages.php?setlang=he" title="\xd7\xa2\xd7\x91\xd7\xa8\xd7\x99\xd7\xaa">he</a>\n'
b'<a href="/packages.php?setlang=hr" title="Hrvatski">hr</a>\n'
b'<a href="/packages.php?setlang=hu" title="Magyar">hu</a>\n'
b'<a href="/packages.php?setlang=it" title="Italiano">it</a>\n'
b'<a href="/packages.php?setlang=nb_NO" title="Norsk">nb_no</a>\n'
b'<a href="/packages.php?setlang=nl" title="Dutch">nl</a>\n'
b'<a href="/packages.php?setlang=pl" title="Polski">pl</a>\n'
b'<a href="/packages.php?setlang=pt" title="Portugu\xc3\xaas">pt</a>\n'
b'<a href="/packages.php?setlang=pt_BR" title="Portugu\xc3\xaas (Brasil)">pt_br</a>\n'
b'<a href="/packages.php?setlang=ro" title="Rom\xc3\xa2n\xc4\x83">ro</a>\n'
b'<a href="/packages.php?setlang=ru" title="\xd0\xa0\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9">ru</a>\n'
b'<a href="/packages.php?setlang=sr" title="Srpski">sr</a>\n'
b'<a href="/packages.php?setlang=tr" title="T\xc3\xbcrk\xc3\xa7e">tr</a>\n'
b'<a href="/packages.php?setlang=uk" title="\xd0\xa3\xd0\xba\xd1\x80\xd0\xb0\xd1\x97\xd0\xbd\xd1\x81\xd1\x8c\xd0\xba\xd0\xb0">uk</a>\n'
b'<a href="/packages.php?setlang=zh_CN" title="\xe7\xae\x80\xe4\xbd\x93\xe4\xb8\xad\xe6\x96\x87">zh_cn</a>\n'
b'\t</div>\n'
b'\t<!-- Start of main content -->\n'
b'\n'
b'\n'
b'\n'
b'\n'
b'\n'
b"<div class='pgbox'>\n"
b"<form action='packages.php' method='get'>\n"
b"<div class='pgboxtitle'>\n"
b"\t<span class='f3'>Search Criteria</span>\n"
b"\t<input type='hidden' name='O' value='0' />\n"
b'\t<input type=\'text\' name=\'K\' size=\'30\' value="groovy" maxlength=\'35\' />\n'
b"\t<input type='submit' style='min-width:80px' class='button' name='do_Search' value='Go' />\n"
b'\t\t<a href="?O=0&amp;K=groovy&amp;do_Search=Los&amp;PP=50&amp;detail=1">Advanced</a>\n'
b'</div>\n'
b'\n'
b'\t\t\t</form>\n'
b'</div>\n'
b"\t<form action='packages.php?O=0&amp;K=groovy&amp;do_Search=Los' method='post'>\n"
b'\t\t<div class="pgbox">\n'
b'\t\t\t<div class="pgboxtitle">\n'
b"\t\t\t\t<span class='f3'>Package Listing</span>\n"
b'\t\t\t</div>\n'
b'\n'
b'\n'
b'\n'
b'\n'
b"<table width='100%' cellspacing='0' cellpadding='2'>\n"
b'<tr>\n'
b'\t\n'
b"\t<th style='border-bottom: #666 1px solid; vertical-align: bottom'><span class='f2'>\n"
b"\t\t<a href='?O=0&amp;K=groovy&amp;do_Search=Los&amp;PP=50&amp;SB=c&amp;SO=d'>Category</a>\n"
b'\t</span></th>\n'
b"\t<th style='border-bottom: #666 1px solid; vertical-align: bottom; text-align: center;'><span class='f2'>\n"
b"\t\t<a href='?O=0&amp;K=groovy&amp;do_Search=Los&amp;PP=50&amp;SB=n&amp;SO=d'>Name</a>\n"
b'\t</span></th>\n'
b"\t<th style='border-bottom: #666 1px solid; vertical-align: bottom'><span class='f2'>\n"
b"\t\t<a href='?O=0&amp;K=groovy&amp;do_Search=Los&amp;PP=50&amp;SB=v&amp;SO=d'>Votes</a>\n"
b'\t</span></th>\n'
b'\n'
b"\t\t<th style='border-bottom: #666 1px solid; vertical-align: bottom; text-align: center;'><span class='f2'>Description</span></th>\n"
b"\t<th style='border-bottom: #666 1px solid; vertical-align: bottom'><span class='f2'>\n"
b"\t\t<a href='?O=0&amp;K=groovy&amp;do_Search=Los&amp;PP=50&amp;SB=m&amp;SO=d'>Maintainer</a>\n"
b'\t</span></th>\n'
b'</tr>\n'
b'\n'
b'<tr>\n'
b"\t\t<td class='data1'><span class='f5'><span class='blue'>devel</span></span></td>\n"
b"\t<td class='data1'><span class='f4'><a href='packages.php?ID=43399'><span class='black'>gant 1.9.3-1</span></a></span></td>\n"
b'\t<td class=\'data1\' style="text-align: right"><span class=\'f5\'><span class=\'blue\'>1</span></span></td>\n'
b"\t\t<td class='data1'><span class='f4'><span class='blue'>\n"
b'\tA Groovy-based build system that uses Ant tasks, but no XML</span></span></td>\n'
b"\t<td class='data1'><span class='f5'><span class='blue'>\n"
b"\t\t<a href='packages.php?K=szym&amp;SeB=m'>szym</a>\n"
b'\t\t</span></span></td>\n'
b'</tr>\n'
b'<tr>\n'
b"\t\t<td class='data2'><span class='f5'><span class='blue'>devel</span></span></td>\n"
b"\t<td class='data2'><span class='f4'><a href='packages.php?ID=12645'><span class='black'>grails 1.3.7-3</span></a></span></td>\n"
b'\t<td class=\'data2\' style="text-align: right"><span class=\'f5\'><span class=\'blue\'>73</span></span></td>\n'
b"\t\t<td class='data2'><span class='f4'><span class='blue'>\n"
b'\tGroovy on rails</span></span></td>\n'
b"\t<td class='data2'><span class='f5'><span class='blue'>\n"
b"\t\t<a href='packages.php?K=trontonic&amp;SeB=m'>trontonic</a>\n"
b'\t\t</span></span></td>\n'
b'</tr>\n'
b'<tr>\n'
b"\t\t<td class='data1'><span class='f5'><span class='blue'>devel</span></span></td>\n"
b"\t<td class='data1'><span class='f4'><a href='packages.php?ID=294'><span class='black'>groovy 1.8.2-1</span></a></span></td>\n"
b'\t<td class=\'data1\' style="text-align: right"><span class=\'f5\'><span class=\'blue\'>135</span></span></td>\n'
b"\t\t<td class='data1'><span class='f4'><span class='blue'>\n"
b'\tGroovy is a Java based scripting language, similar to Python, Ruby and Smalltalk</span></span></td>\n'
b"\t<td class='data1'><span class='f5'><span class='blue'>\n"
b"\t\t<a href='packages.php?K=Musikolo&amp;SeB=m'>Musikolo</a>\n"
b'\t\t</span></span></td>\n'
b'</tr>\n'
b'<tr>\n'
b"\t\t<td class='data2'><span class='f5'><span class='blue'>devel</span></span></td>\n"
b"\t<td class='data2'><span class='f4'><a href='packages.php?ID=31909'><span class='black'>groovy-docs 1.8.2-1</span></a></span></td>\n"
b'\t<td class=\'data2\' style="text-align: right"><span class=\'f5\'><span class=\'blue\'>6</span></span></td>\n'
b"\t\t<td class='data2'><span class='f4'><span class='blue'>\n"
b'\tDocumentation for the Groovy programming language.</span></span></td>\n'
b"\t<td class='data2'><span class='f5'><span class='blue'>\n"
b"\t\t<a href='packages.php?K=bruce&amp;SeB=m'>bruce</a>\n"
b'\t\t</span></span></td>\n'
b'</tr>\n'
b'<tr>\n'
b"\t\t<td class='outofdate'><span class='f5'><span class='blue'>none</span></span></td>\n"
b"\t<td class='outofdate'><span class='f4'><a href='packages.php?ID=47672'><span class='black'>groovyserv 0.6-1</span></a></span></td>\n"
b'\t<td class=\'outofdate\' style="text-align: right"><span class=\'f5\'><span class=\'blue\'>1</span></span></td>\n'
b"\t\t<td class='outofdate'><span class='f4'><span class='blue'>\n"
b'\tGroovyServ makes Groovy&#039;s startup time much faster, by pre-invoking Groovy as a server.</span></span></td>\n'
b"\t<td class='outofdate'><span class='f5'><span class='blue'>\n"
b"\t\t<span style='color: blue; font-style: italic;'>orphan</span>\n"
b'\t\t</span></span></td>\n'
b'</tr>\n'
b'\n'
b'\t</table>\n'
b'</div> <!-- .pgbox ??! -->\n'
b'\n'
b'\n'
b'\t\t<div class="pgbox pkg_search_results_footer">\n'
b'\t\t\t<div class="legend_and_actions">\n'
b'\t\t\t\t<div class="legend">\n'
b"\t\t\t\t\t<span class='f3'>Legend</span>\n"
b'\t\t\t\t\t<span class="outofdate">Out of Date</span>\n'
b'\t\t\t\t</div>\n'
b'\t\t\t\t\t\t\t</div> <!-- .legend_and_actions -->\n'
b'\t\t\t<div class="page_links">\n'
b'\t\t\t\t<div class="f4 blue">\n'
b'\t\t\t\t\tShowing results 1 - 5 of 5\t\t\t\t</div>\n'
b'\t\t\t\t<div class="page_nav">\n'
b'\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<span class="page_sel">1</span>\n'
b'\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</div>\n'
b'\t\t\t</div> <!-- .page_links -->\n'
b'\t\t</div> <!-- .pgbox .pkg_search_results_footer -->\n'
b'\t</form>\n'
b'\n'
b'\t<!-- End of main content -->\n'
b'<div class="pgbox version">v1.9.0</div>\t</body>\n'
b'</html>\n'

Now i have no clue how to process further.  I am only interested on the results (In the Groovy example, the 5 shown here: http://aur.archlinux.org/packages.php?O … earch=Los)

Any Ideas or clues? Maybe with some RegExp?

Thanks in advance

Last edited by hueck (2011-10-08 23:28:51)

Offline

#2 2011-10-09 00:22:17

Nisstyre56
Member
From: Canada
Registered: 2010-03-25
Posts: 85

Re: Python 3 - Parse search output

Do NOT use regular expressions for this. They won't work. Never use regular expressions to parse html or xml (or any other syntax) unless you're using them to write some kind of tokenizer.

What you want is lxml: http://lxml.de/index.html

You can read more about lexers and parsers here to get a general overview of what lxml does: http://en.wikipedia.org/wiki/Lexical_analysis

Also it might be helpful to do a bit of research on xml, and understand how it works.

Last edited by Nisstyre56 (2011-10-09 00:26:25)


In Zen they say: If something is boring after two minutes, try it for four. If still boring, try it for eight, sixteen, thirty-two, and so on. Eventually one discovers that it's not boring at all but very interesting.
~ John Cage

Offline

#3 2011-10-09 06:30:03

rockin turtle
Member
From: Montana, USA
Registered: 2009-10-22
Posts: 227

Re: Python 3 - Parse search output

Try this: (see http://docs.python.org/py3k/library/xml … ttree.html)

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import sys
import xml.etree.ElementTree
import urllib.request

search_url = ''.join(['http://aur.archlinux.org/packages.php?O=0&K=', sys.argv[1], '&do_Search=Los'])

dstr = [str(line, 'utf-8') for line in urllib.request.urlopen(search_url)]

element = xml.etree.ElementTree.fromstringlist(dstr)

oldv=''
line=''
for e in element.iter('{http://www.w3.org/1999/xhtml}td'):
	v = e.get('class')
	if v != oldv:
		if len(line) > 0: print (line)
		line = ''
	for s in e.itertext():
		s = s.lstrip()
		s = s.rstrip()
		if len(line) == 0:
			line = s
		else:
			line = ' '.join([line, s])
	oldv = v
print (line)

Last edited by rockin turtle (2011-10-09 06:32:21)

Offline

#4 2011-10-09 09:42:13

hueck
Member
Registered: 2011-09-15
Posts: 12

Re: Python 3 - Parse search output

Thx, i am gonna do some research and try it out.

Offline

#5 2011-10-15 13:22:43

Mr.Elendig
#archlinux@freenode channel op
From: The intertubes
Registered: 2004-11-07
Posts: 4,094

Re: Python 3 - Parse search output

Use the api instead of parsing the page, seriously.

[oh@Alice][~]% curl "https://aur.archlinux.org/rpc.php?type=search&arg=groovy" 2>/dev/null | json_pp 
{
   "type" : "search",
   "results" : [
      {
         "URL" : "http://groovy.codehaus.org",
         "ID" : "294",
         "FirstSubmitted" : "1113416437",
         "Maintainer" : "Musikolo",
         "OutOfDate" : "0",
         "CategoryID" : "3",
         "License" : "BSD/Apache style licence",
         "URLPath" : "/packages/gr/groovy/groovy.tar.gz",
         "NumVotes" : "135",
         "Version" : "1.8.2-1",
         "Name" : "groovy",
         "Description" : "Groovy is a Java based scripting language, similar to Python, Ruby and Smalltalk",
         "LastModified" : "1315523364"
      },
      {
         "URL" : "http://grails.org/",
         "ID" : "12645",
         "FirstSubmitted" : "1187846781",
         "Maintainer" : "trontonic",
         "OutOfDate" : "0",
         "CategoryID" : "3",
         "License" : "Apache",
         "URLPath" : "/packages/gr/grails/grails.tar.gz",
         "NumVotes" : "73",
         "Version" : "1.3.7-3",
         "Name" : "grails",
         "Description" : "Groovy on rails",
         "LastModified" : "1305713109"
      },
      {
         "URL" : "http://groovy.codehaus.org",
         "ID" : "31909",
         "FirstSubmitted" : "1257899383",
         "Maintainer" : "bruce",
         "OutOfDate" : "0",
         "CategoryID" : "3",
         "License" : "APACHE",
         "URLPath" : "/packages/gr/groovy-docs/groovy-docs.tar.gz",
         "NumVotes" : "6",
         "Version" : "1.8.2-1",
         "Name" : "groovy-docs",
         "Description" : "Documentation for the Groovy programming language.",
         "LastModified" : "1315709403"
      },
      {
         "URL" : "http://gant.codehaus.org",
         "ID" : "43399",
         "FirstSubmitted" : "1289426860",
         "Maintainer" : "szym",
         "OutOfDate" : "0",
         "CategoryID" : "3",
         "License" : "APACHE",
         "URLPath" : "/packages/ga/gant/gant.tar.gz",
         "NumVotes" : "1",
         "Version" : "1.9.3-1",
         "Name" : "gant",
         "Description" : "A Groovy-based build system that uses Ant tasks, but no XML",
         "LastModified" : "1289426860"
      },
      {
         "URL" : "http://kobo.github.com/groovyserv/index.html",
         "ID" : "47672",
         "FirstSubmitted" : "1300877392",
         "Maintainer" : null,
         "OutOfDate" : "1",
         "CategoryID" : "1",
         "License" : "apache-ant",
         "URLPath" : "/packages/gr/groovyserv/groovyserv.tar.gz",
         "NumVotes" : "1",
         "Version" : "0.6-1",
         "Name" : "groovyserv",
         "Description" : "GroovyServ makes Groovy's startup time much faster, by pre-invoking Groovy as a server.",
         "LastModified" : "1300877392"
      }
   ]
}
[oh@Alice][~]% 

Last edited by Mr.Elendig (2011-10-15 13:26:58)


Evil #archlinux@libera.chat channel op and general support dude.
. files on github, Screenshots, Random pics and the rest

Offline

Board footer

Powered by FluxBB