Arch Linux Forums / [SOLVED] Parsing HTML in bash script

Re: [SOLVED] Parsing HTML in bash script

2012-12-30T05:45:00Z

Note that you can't actually parse HTML with grep.

Re: [SOLVED] Parsing HTML in bash script

2012-12-30T01:30:49Z

It's one of the more obscure features of bash. If you wrap a command with "<(" and ")", the expression is replaced with a named pipe containing the output of the command. For example,

cat <(curl https://www.archlinux.org)

would do the equivalent of

mkfifo tmp.fifo
curl https://www.archlinux.org > tmp.fifo &
cat tmp.fifo

The additional "<" (with a space) feeds this "input file" into the while loop, just as "< foo.txt" would feed the contents of the file "foo.txt" into STDIN for some command. For example

while read line
do
  echo "$line"
done < foo.txt

would loop over each line of "foo.txt". Replacing "foo.txt" with "<(curl https://www.archlinux.org)" would loop over content of "https://www.archlinux.org" instead.

At first I did the same thing as you did in your previous post (temporarily change IFS and then load the content directly into an array), but I then concluded that it is a bad idea. Unexpected characters in the description will break that code and there are some security risks as well as it could lead to remote code execution, however unlikely it may be for code to make its way into the text. I believe that my suggestion is safer and will handle all possible cases.

Again, note that you can replace

sed 's///' | sed 's/<\/div>/\|/')

with

sed 's///;s/<\/div>/\|/')

If you want to buy 2 things from the shop, there's no reason to walk there and back again twice to buy each thing separately.

Re: [SOLVED] Parsing HTML in bash script

2012-12-29T19:22:14Z

I actually came up with a solution a bit ago that seems to work. I would however like to understand your solution. Here is mine:

#!/bin/bash

OIFS="$IFS"
IFS=$'|'
s=$(curl -s http://www.imdb.com/title/tt0092455/episodes?season=1 | grep "item_description" | sed 's///' | sed 's/<\/div>/\|/')
array=($s)
IFS="$OIFS"

echo -e "Season 1 epsode 1\n"
echo ${array[0]}

can you explain to me how this while loop is working? Is the last line of it

done < <(curl -s http://www.imdb.com/title/tt0092455/episodes?season=1 | grep "item_description" | sed 's//\ /;s/<\/div>/\n/')

feeding lines of text into the $line variable? As I said before I'm not too familiar with bash syntax.

Re: [SOLVED] Parsing HTML in bash script

2012-12-29T19:00:38Z

instantaphex wrote:

Now I suppose if I could populate each array element with the paragraphs that result by moving on to the next element every time a line break is reached I would be golden. I'm not yet sure how to do that. I don't have much experience in bash scripting. Anyone have any ideas of how I might go about this?

#!/bin/bash
paragraphs=()
# populate the array using a while loop and process substitution (direct piping will not work)
while read line
do
  paragraphs+=("$line")
done < <(curl -s http://www.imdb.com/title/tt0092455/episodes?season=1 | grep "item_description" | sed 's//\ /;s/<\/div>/\n/')

echo first paragraph
echo "${paragraphs[0]}"
echo
echo fifth paragraph
echo "${paragraphs[4]}"
echo
echo all paragraphs
for paragraph in "${paragraphs[@]}"
do
  echo "$paragraph"
done

edit: note that I have combined the separate sed commands into a single one using semicolons to separate the substitutions.

Re: [SOLVED] Parsing HTML in bash script

2012-12-29T18:29:28Z

So I've been trying out a couple of things and have made a little progress. Here is what I have so far:

curl -s http://www.imdb.com/title/tt0092455/episodes?season=1 | grep "item_description" | sed 's//\ /' | sed 's/<\/div>/\n/'

Now I suppose if I could populate each array element with the paragraphs that result by moving on to the next element every time a line break is reached I would be golden. I'm not yet sure how to do that. I don't have much experience in bash scripting. Anyone have any ideas of how I might go about this?

Re: [SOLVED] Parsing HTML in bash script

2012-12-29T18:01:20Z

Thanks for the tip. Changed.

Re: [SOLVED] Parsing HTML in bash script

2012-12-29T08:44:24Z

You may want to change your title, since what you're looking for is ways to properly parse html and the title does not indicate that at all.

[SOLVED] Parsing HTML in bash script

2012-12-29T08:30:47Z

So this is what I'm trying to do. Ive created a bash function that allows me to play episodes of tng by using the following syntax:

tng 5 25

This would play season 5 episode 25. I was thinking of ways to extend the script and I thought of adding episode summaries. I want to use curl to extract the html from imdb that Contains the descriptions. Ive tried parsing with grep and inserting into an array but every space creates a new element which wont work. Ive also thought of piping stdout to w3m and using text only but i cant seem to solve this at the moment. Any suggestiosns?