Parsing HTML webpages

kopiersperre · 2015-11-23 09:57:00

I would like to parse a bunch of HTML 5 pages, but I'm a programming noob. How do I extract the title and the authors from

<data><article ><h1 >$title</h1>

<a class="authorpopup" ... >$Autor1</a><a class="authorpopup" ... >$Autor2</a>   <span class="tajaxpopup" id="titleAuthorInfo"></span>

? Would you advise me to do it with PHP?

Greetings,
kopiersperre

ukhippo · 2015-11-23 10:44:57

PHP? No, lol.

Googling (well using DuckDuckGo) throws up lots of HTML Parsers in a variety of languages.
The second hit was https://en.m.wikipedia.org/wiki/Compari … ML_parsers which would be a good place to start your research...

kopiersperre · 2015-11-23 11:35:14

When I try to open a HTTPS domain with urllib in python3, I get

urllib.error.URLError: <urlopen error [SSL: WRONG_CIPHER_RETURNED] wrong cipher returned (_ssl.c:646)>

txtsd · 2015-11-23 12:43:52

Don't use urllib unless you're fairly good with python and networking in general. Use requests.
For parsing html/xml, you want to look at BeautifulSoup and lxml. There are a few others, but BS4 is relatively easy, and lxml is relatively powerful.
You could also use regex, but it's a world of hurt.

Last edited by txtsd (2015-11-23 12:44:42)

ukhippo · 2015-11-23 14:51:11

kopiersperre wrote:

When I try to open a HTTPS domain with urllib in python3, I get
urllib.error.URLError: <urlopen error [SSL: WRONG_CIPHER_RETURNED] wrong cipher returned (_ssl.c:646)>

That's typically caused by the server claiming to support at most TLS 1.0 and then using newer ciphers defined in TLS 1.2 because the client said it supported TLS 1.2.
If you can't fix the server, you'll need to either make the client only support TLS 1.0 [1] or use a different client/library that is tolerant of these buggy servers.

[1] for Python/urllib you need to pass a ssl.SSLContext on the urlopen call that has the protocol set to ssl.PROTOCOL_TLSv1

kopiersperre · 2015-11-23 17:36:53

Thanks. Is anyone willing to answer me questions to BeautifulSoup here?

Yggdrasil · 2015-11-24 12:51:57

Just ask your question and you'll see if someone can answer it.

Another way to do this is using xpath and xslt. To do this in a shell script: Convert html to xml with tidy and then use xmlstarlet to query for the nodes you are interested in. That's probably harder than traversing a parser tree if you don't have prior knowledge about xpath. And it also depends on what you want to do with the extracted data afterwards. Just mentioning it here as an alternative.

very simple example, requires xmlstarlet:

#!/bin/bash

echo '<data><article ><h1 >$title</h1> <a class="authorpopup" >$Autor1</a><a class="authorpopup" >$Autor2</a>   <span class="tajaxpopup" id="titleAuthorInfo"></span> </article><article><h1>title2</h1><a class="authorpopup">foobar</a></article></data>' \
 | xml sel -T\
	-t -m '//article' \
		-o 'title: ' -v './h1' -n \
		-o 'authors:' -n \
		-m 'a[@class="authorpopup"]' \
			-o ' + ' -v '.' -n

Arch Linux

#1 2015-11-23 09:57:00

Parsing HTML webpages

#2 2015-11-23 10:44:57

Re: Parsing HTML webpages

#3 2015-11-23 11:35:14

Re: Parsing HTML webpages

#4 2015-11-23 12:43:52

Re: Parsing HTML webpages

#5 2015-11-23 14:51:11

Re: Parsing HTML webpages

#6 2015-11-23 17:36:53

Re: Parsing HTML webpages

#7 2015-11-24 12:51:57

Re: Parsing HTML webpages

Board footer