[SOLVED] Parsing HTML in bash script

instantaphex · 2012-12-29 08:30:47

So this is what I'm trying to do. Ive created a bash function that allows me to play episodes of tng by using the following syntax:

tng 5 25

This would play season 5 episode 25. I was thinking of ways to extend the script and I thought of adding episode summaries. I want to use curl to extract the html from imdb that Contains the descriptions. Ive tried parsing with grep and inserting into an array but every space creates a new element which wont work. Ive also thought of piping stdout to w3m and using text only but i cant seem to solve this at the moment. Any suggestiosns?

Last edited by instantaphex (2012-12-29 19:22:56)

ngoonee · 2012-12-29 08:44:24

You may want to change your title, since what you're looking for is ways to properly parse html and the title does not indicate that at all.

instantaphex · 2012-12-29 18:01:20

Thanks for the tip. Changed.

instantaphex · 2012-12-29 18:29:28

So I've been trying out a couple of things and have made a little progress. Here is what I have so far:

curl -s http://www.imdb.com/title/tt0092455/episodes?season=1 | grep "item_description" | sed 's/<div class=\"item_description\" itemprop=\"description\">/\ /' | sed 's/<\/div>/\n/'

Now I suppose if I could populate each array element with the paragraphs that result by moving on to the next element every time a line break is reached I would be golden. I'm not yet sure how to do that. I don't have much experience in bash scripting. Anyone have any ideas of how I might go about this?

Xyne · 2012-12-29 19:00:38

instantaphex wrote:

Now I suppose if I could populate each array element with the paragraphs that result by moving on to the next element every time a line break is reached I would be golden. I'm not yet sure how to do that. I don't have much experience in bash scripting. Anyone have any ideas of how I might go about this?

#!/bin/bash
paragraphs=()
# populate the array using a while loop and process substitution (direct piping will not work)
while read line
do
  paragraphs+=("$line")
done < <(curl -s http://www.imdb.com/title/tt0092455/episodes?season=1 | grep "item_description" | sed 's/<div class=\"item_description\" itemprop=\"description\">/\ /;s/<\/div>/\n/')

echo first paragraph
echo "${paragraphs[0]}"
echo
echo fifth paragraph
echo "${paragraphs[4]}"
echo
echo all paragraphs
for paragraph in "${paragraphs[@]}"
do
  echo "$paragraph"
done

edit: note that I have combined the separate sed commands into a single one using semicolons to separate the substitutions.

Last edited by Xyne (2012-12-29 19:01:36)

instantaphex · 2012-12-29 19:22:14

I actually came up with a solution a bit ago that seems to work. I would however like to understand your solution. Here is mine:

#!/bin/bash

OIFS="$IFS"
IFS=$'|'
s=$(curl -s http://www.imdb.com/title/tt0092455/episodes?season=1 | grep "item_description" | sed 's/<div class=\"item_description\" itemprop=\"description\">//' | sed 's/<\/div>/\|/')
array=($s)
IFS="$OIFS"

echo -e "Season 1 epsode 1\n"
echo ${array[0]}

can you explain to me how this while loop is working? Is the last line of it

done < <(curl -s http://www.imdb.com/title/tt0092455/episodes?season=1 | grep "item_description" | sed 's/<div class=\"item_description\" itemprop=\"description\">/\ /;s/<\/div>/\n/')

feeding lines of text into the $line variable? As I said before I'm not too familiar with bash syntax.

Xyne · 2012-12-30 01:30:49

It's one of the more obscure features of bash. If you wrap a command with "<(" and ")", the expression is replaced with a named pipe containing the output of the command. For example,

cat <(curl https://www.archlinux.org)

would do the equivalent of

mkfifo tmp.fifo
curl https://www.archlinux.org > tmp.fifo &
cat tmp.fifo

The additional "<" (with a space) feeds this "input file" into the while loop, just as "< foo.txt" would feed the contents of the file "foo.txt" into STDIN for some command. For example

while read line
do
  echo "$line"
done < foo.txt

would loop over each line of "foo.txt". Replacing "foo.txt" with "<(curl https://www.archlinux.org)" would loop over content of "https://www.archlinux.org" instead.

At first I did the same thing as you did in your previous post (temporarily change IFS and then load the content directly into an array), but I then concluded that it is a bad idea. Unexpected characters in the description will break that code and there are some security risks as well as it could lead to remote code execution, however unlikely it may be for code to make its way into the text. I believe that my suggestion is safer and will handle all possible cases.

Again, note that you can replace

sed 's/<div class=\"item_description\" itemprop=\"description\">//' | sed 's/<\/div>/\|/')

with

sed 's/<div class=\"item_description\" itemprop=\"description\">//;s/<\/div>/\|/')

If you want to buy 2 things from the shop, there's no reason to walk there and back again twice to buy each thing separately.

Last edited by Xyne (2012-12-30 01:34:17)

Nisstyre56 · 2012-12-30 05:45:00

Note that you can't actually parse HTML with grep.

Arch Linux

#1 2012-12-29 08:30:47

[SOLVED] Parsing HTML in bash script

#2 2012-12-29 08:44:24

Re: [SOLVED] Parsing HTML in bash script

#3 2012-12-29 18:01:20

Re: [SOLVED] Parsing HTML in bash script

#4 2012-12-29 18:29:28

Re: [SOLVED] Parsing HTML in bash script

#5 2012-12-29 19:00:38

Re: [SOLVED] Parsing HTML in bash script

#6 2012-12-29 19:22:14

Re: [SOLVED] Parsing HTML in bash script

#7 2012-12-30 01:30:49

Re: [SOLVED] Parsing HTML in bash script

#8 2012-12-30 05:45:00

Re: [SOLVED] Parsing HTML in bash script

Board footer