You are not logged in.

#1 2015-11-23 09:57:00

kopiersperre
Member
Registered: 2011-03-22
Posts: 48

Parsing HTML webpages

I would like to parse a bunch of HTML 5 pages, but I'm a programming noob. How do I extract the title and the authors from

<data><article ><h1 >$title</h1>
<a class="authorpopup" ... >$Autor1</a><a class="authorpopup" ... >$Autor2</a>   <span class="tajaxpopup" id="titleAuthorInfo"></span>

? Would you advise me to do it with PHP?

Greetings,
kopiersperre

Offline

#2 2015-11-23 10:44:57

ukhippo
Member
From: Non-paged pool
Registered: 2014-02-21
Posts: 366

Re: Parsing HTML webpages

PHP? No, lol.

Googling (well using DuckDuckGo) throws up lots of HTML Parsers in a variety of languages.
The second hit was https://en.m.wikipedia.org/wiki/Compari … ML_parsers which would be a good place to start your research...

Offline

#3 2015-11-23 11:35:14

kopiersperre
Member
Registered: 2011-03-22
Posts: 48

Re: Parsing HTML webpages

When I try to open a HTTPS domain with urllib in python3, I get

urllib.error.URLError: <urlopen error [SSL: WRONG_CIPHER_RETURNED] wrong cipher returned (_ssl.c:646)>

Offline

#4 2015-11-23 12:43:52

txtsd
Member
Registered: 2014-10-02
Posts: 96
Website

Re: Parsing HTML webpages

Don't use urllib unless you're fairly good with python and networking in general. Use requests.
For parsing html/xml, you want to look at BeautifulSoup and lxml. There are a few others, but BS4 is relatively easy, and lxml is relatively powerful.
You could also use regex, but it's a world of hurt.

Last edited by txtsd (2015-11-23 12:44:42)


[CPU] AMD Ryzen 5 2400G
[iGPU] AMD RX Vega 11
[Kernel] linux-zen
[sway] [zsh] Arch user since [2014-09-01 02:09]

Offline

#5 2015-11-23 14:51:11

ukhippo
Member
From: Non-paged pool
Registered: 2014-02-21
Posts: 366

Re: Parsing HTML webpages

kopiersperre wrote:

When I try to open a HTTPS domain with urllib in python3, I get

urllib.error.URLError: <urlopen error [SSL: WRONG_CIPHER_RETURNED] wrong cipher returned (_ssl.c:646)>

That's typically caused by the server claiming to support at most TLS 1.0 and then using newer ciphers defined in TLS 1.2 because the client said it supported TLS 1.2.
If you can't fix the server, you'll need to either make the client only support TLS 1.0 [1] or use a different client/library that is tolerant of these buggy servers.

[1] for Python/urllib you need to pass a ssl.SSLContext on the urlopen call that has the protocol set to ssl.PROTOCOL_TLSv1

Offline

#6 2015-11-23 17:36:53

kopiersperre
Member
Registered: 2011-03-22
Posts: 48

Re: Parsing HTML webpages

Thanks. Is anyone willing to answer me questions to BeautifulSoup here?

Offline

#7 2015-11-24 12:51:57

Yggdrasil
Member
Registered: 2009-07-21
Posts: 37

Re: Parsing HTML webpages

Just ask your question and you'll see if someone can answer it.

Another way to do this is using xpath and xslt. To do this in a shell script: Convert html to xml with tidy and then use xmlstarlet to query for the nodes you are interested in. That's probably harder than traversing a parser tree if you don't have prior knowledge about xpath. And it also depends on what you want to do with the extracted data afterwards. Just mentioning it here as an alternative.

very simple example, requires xmlstarlet:

#!/bin/bash

echo '<data><article ><h1 >$title</h1> <a class="authorpopup" >$Autor1</a><a class="authorpopup" >$Autor2</a>   <span class="tajaxpopup" id="titleAuthorInfo"></span> </article><article><h1>title2</h1><a class="authorpopup">foobar</a></article></data>' \
 | xml sel -T\
	-t -m '//article' \
		-o 'title: ' -v './h1' -n \
		-o 'authors:' -n \
		-m 'a[@class="authorpopup"]' \
			-o ' + ' -v '.' -n

Offline

Board footer

Powered by FluxBB