You are not logged in.

#1 2012-12-29 08:30:47

instantaphex
Member
Registered: 2012-07-07
Posts: 67

[SOLVED] Parsing HTML in bash script

So this is what I'm trying to do.  Ive created a bash function that allows me to play episodes of tng by using the following syntax:

tng 5 25

This would play season 5 episode 25. I was thinking of ways to extend the script and I thought of adding episode summaries.  I want to use curl to extract the html from imdb that Contains the descriptions.  Ive tried parsing with grep and inserting into an array but every space creates a new element which wont work.  Ive also thought of piping stdout to w3m and using text only but i cant seem to solve this at the moment.  Any suggestiosns?

Last edited by instantaphex (2012-12-29 19:22:56)

Offline

#2 2012-12-29 08:44:24

ngoonee
Forum Fellow
From: Between Thailand and Singapore
Registered: 2009-03-17
Posts: 7,354

Re: [SOLVED] Parsing HTML in bash script

You may want to change your title, since what you're looking for is ways to properly parse html and the title does not indicate that at all.


Allan-Volunteer on the (topic being discussed) mailn lists. You never get the people who matters attention on the forums.
jasonwryan-Installing Arch is a measure of your literacy. Maintaining Arch is a measure of your diligence. Contributing to Arch is a measure of your competence.
Griemak-Bleeding edge, not bleeding flat. Edge denotes falls will occur from time to time. Bring your own parachute.

Offline

#3 2012-12-29 18:01:20

instantaphex
Member
Registered: 2012-07-07
Posts: 67

Re: [SOLVED] Parsing HTML in bash script

Thanks for the tip.  Changed.

Offline

#4 2012-12-29 18:29:28

instantaphex
Member
Registered: 2012-07-07
Posts: 67

Re: [SOLVED] Parsing HTML in bash script

So I've been trying out a couple of things and have made a little progress.  Here is what I have so far:

curl -s http://www.imdb.com/title/tt0092455/episodes?season=1 | grep "item_description" | sed 's/<div class=\"item_description\" itemprop=\"description\">/\ /' | sed 's/<\/div>/\n/'

Now I suppose if I could populate each array element with the paragraphs that result by moving on to the next element every time a line break is reached I would be golden.  I'm not yet sure how to do that.  I don't have much experience in bash scripting.  Anyone have any ideas of how I might go about this?

Offline

#5 2012-12-29 19:00:38

Xyne
Administrator/PM
Registered: 2008-08-03
Posts: 6,963
Website

Re: [SOLVED] Parsing HTML in bash script

instantaphex wrote:

Now I suppose if I could populate each array element with the paragraphs that result by moving on to the next element every time a line break is reached I would be golden.  I'm not yet sure how to do that.  I don't have much experience in bash scripting.  Anyone have any ideas of how I might go about this?

#!/bin/bash
paragraphs=()
# populate the array using a while loop and process substitution (direct piping will not work)
while read line
do
  paragraphs+=("$line")
done < <(curl -s http://www.imdb.com/title/tt0092455/episodes?season=1 | grep "item_description" | sed 's/<div class=\"item_description\" itemprop=\"description\">/\ /;s/<\/div>/\n/')

echo first paragraph
echo "${paragraphs[0]}"
echo
echo fifth paragraph
echo "${paragraphs[4]}"
echo
echo all paragraphs
for paragraph in "${paragraphs[@]}"
do
  echo "$paragraph"
done

edit: note that I have combined the separate sed commands into a single one using semicolons to separate the substitutions.

Last edited by Xyne (2012-12-29 19:01:36)


My Arch Linux StuffForum EtiquetteCommunity Ethos - Arch is not for everyone

Offline

#6 2012-12-29 19:22:14

instantaphex
Member
Registered: 2012-07-07
Posts: 67

Re: [SOLVED] Parsing HTML in bash script

I actually came up with a solution a bit ago that seems to work.  I would however like to understand your solution.  Here is mine:

#!/bin/bash

OIFS="$IFS"
IFS=$'|'
s=$(curl -s http://www.imdb.com/title/tt0092455/episodes?season=1 | grep "item_description" | sed 's/<div class=\"item_description\" itemprop=\"description\">//' | sed 's/<\/div>/\|/')
array=($s)
IFS="$OIFS"

echo -e "Season 1 epsode 1\n"
echo ${array[0]}

can you explain to me how this while loop is working?  Is the last line of it

done < <(curl -s http://www.imdb.com/title/tt0092455/episodes?season=1 | grep "item_description" | sed 's/<div class=\"item_description\" itemprop=\"description\">/\ /;s/<\/div>/\n/')

feeding lines of text into the $line variable?  As I said before I'm not too familiar with bash syntax.

Offline

#7 2012-12-30 01:30:49

Xyne
Administrator/PM
Registered: 2008-08-03
Posts: 6,963
Website

Re: [SOLVED] Parsing HTML in bash script

It's one of the more obscure features of bash. If you wrap a command with "<(" and ")", the expression is replaced with a named pipe containing the output of the command. For example,

cat <(curl https://www.archlinux.org)

would do the equivalent of

mkfifo tmp.fifo
curl https://www.archlinux.org > tmp.fifo &
cat tmp.fifo

The additional "<" (with a space) feeds this "input file" into the while loop, just as "< foo.txt" would feed the contents of the file "foo.txt" into STDIN for some command. For example

while read line
do
  echo "$line"
done < foo.txt

would loop over each line of "foo.txt". Replacing "foo.txt" with "<(curl https://www.archlinux.org)" would loop over content of "https://www.archlinux.org" instead.

At first I did the same thing as you did in your previous post (temporarily change IFS and then load the content directly into an array), but I then concluded that it is a bad idea. Unexpected characters in the description will break that code and there are some security risks as well as it could lead to remote code execution, however unlikely it may be for code to make its way into the text. I believe that my suggestion is safer and will handle all possible cases.

Again, note that you can replace

sed 's/<div class=\"item_description\" itemprop=\"description\">//' | sed 's/<\/div>/\|/')

with

sed 's/<div class=\"item_description\" itemprop=\"description\">//;s/<\/div>/\|/')

If you want to buy 2 things from the shop, there's no reason to walk there and back again twice to buy each thing separately. wink

Last edited by Xyne (2012-12-30 01:34:17)


My Arch Linux StuffForum EtiquetteCommunity Ethos - Arch is not for everyone

Offline

#8 2012-12-30 05:45:00

Nisstyre56
Member
From: Canada
Registered: 2010-03-25
Posts: 85

Re: [SOLVED] Parsing HTML in bash script

Note that you can't actually parse HTML with grep.


In Zen they say: If something is boring after two minutes, try it for four. If still boring, try it for eight, sixteen, thirty-two, and so on. Eventually one discovers that it's not boring at all but very interesting.
~ John Cage

Offline

Board footer

Powered by FluxBB