You are not logged in.
Pages: 1
I would like to parse a bunch of HTML 5 pages, but I'm a programming noob. How do I extract the title and the authors from
<data><article ><h1 >$title</h1>
<a class="authorpopup" ... >$Autor1</a><a class="authorpopup" ... >$Autor2</a> <span class="tajaxpopup" id="titleAuthorInfo"></span>
? Would you advise me to do it with PHP?
Greetings,
kopiersperre
Offline
PHP? No, lol.
Googling (well using DuckDuckGo) throws up lots of HTML Parsers in a variety of languages.
The second hit was https://en.m.wikipedia.org/wiki/Compari … ML_parsers which would be a good place to start your research...
Offline
When I try to open a HTTPS domain with urllib in python3, I get
urllib.error.URLError: <urlopen error [SSL: WRONG_CIPHER_RETURNED] wrong cipher returned (_ssl.c:646)>
Offline
Don't use urllib unless you're fairly good with python and networking in general. Use requests.
For parsing html/xml, you want to look at BeautifulSoup and lxml. There are a few others, but BS4 is relatively easy, and lxml is relatively powerful.
You could also use regex, but it's a world of hurt.
Last edited by txtsd (2015-11-23 12:44:42)
[CPU] AMD Ryzen 5 2400G
[iGPU] AMD RX Vega 11
[Kernel] linux-zen
[sway] • [zsh] • Arch user since [2014-09-01 02:09]
Offline
When I try to open a HTTPS domain with urllib in python3, I get
urllib.error.URLError: <urlopen error [SSL: WRONG_CIPHER_RETURNED] wrong cipher returned (_ssl.c:646)>
That's typically caused by the server claiming to support at most TLS 1.0 and then using newer ciphers defined in TLS 1.2 because the client said it supported TLS 1.2.
If you can't fix the server, you'll need to either make the client only support TLS 1.0 [1] or use a different client/library that is tolerant of these buggy servers.
[1] for Python/urllib you need to pass a ssl.SSLContext on the urlopen call that has the protocol set to ssl.PROTOCOL_TLSv1
Offline
Thanks. Is anyone willing to answer me questions to BeautifulSoup here?
Offline
Just ask your question and you'll see if someone can answer it.
Another way to do this is using xpath and xslt. To do this in a shell script: Convert html to xml with tidy and then use xmlstarlet to query for the nodes you are interested in. That's probably harder than traversing a parser tree if you don't have prior knowledge about xpath. And it also depends on what you want to do with the extracted data afterwards. Just mentioning it here as an alternative.
very simple example, requires xmlstarlet:
#!/bin/bash
echo '<data><article ><h1 >$title</h1> <a class="authorpopup" >$Autor1</a><a class="authorpopup" >$Autor2</a> <span class="tajaxpopup" id="titleAuthorInfo"></span> </article><article><h1>title2</h1><a class="authorpopup">foobar</a></article></data>' \
| xml sel -T\
-t -m '//article' \
-o 'title: ' -v './h1' -n \
-o 'authors:' -n \
-m 'a[@class="authorpopup"]' \
-o ' + ' -v '.' -n
Offline
Pages: 1