You are not logged in.
So this is what I'm trying to do. Ive created a bash function that allows me to play episodes of tng by using the following syntax:
tng 5 25
This would play season 5 episode 25. I was thinking of ways to extend the script and I thought of adding episode summaries. I want to use curl to extract the html from imdb that Contains the descriptions. Ive tried parsing with grep and inserting into an array but every space creates a new element which wont work. Ive also thought of piping stdout to w3m and using text only but i cant seem to solve this at the moment. Any suggestiosns?
Last edited by instantaphex (2012-12-29 19:22:56)
Offline
You may want to change your title, since what you're looking for is ways to properly parse html and the title does not indicate that at all.
Allan-Volunteer on the (topic being discussed) mailn lists. You never get the people who matters attention on the forums.
jasonwryan-Installing Arch is a measure of your literacy. Maintaining Arch is a measure of your diligence. Contributing to Arch is a measure of your competence.
Griemak-Bleeding edge, not bleeding flat. Edge denotes falls will occur from time to time. Bring your own parachute.
Offline
Thanks for the tip. Changed.
Offline
So I've been trying out a couple of things and have made a little progress. Here is what I have so far:
curl -s http://www.imdb.com/title/tt0092455/episodes?season=1 | grep "item_description" | sed 's/<div class=\"item_description\" itemprop=\"description\">/\ /' | sed 's/<\/div>/\n/'
Now I suppose if I could populate each array element with the paragraphs that result by moving on to the next element every time a line break is reached I would be golden. I'm not yet sure how to do that. I don't have much experience in bash scripting. Anyone have any ideas of how I might go about this?
Offline
Now I suppose if I could populate each array element with the paragraphs that result by moving on to the next element every time a line break is reached I would be golden. I'm not yet sure how to do that. I don't have much experience in bash scripting. Anyone have any ideas of how I might go about this?
#!/bin/bash
paragraphs=()
# populate the array using a while loop and process substitution (direct piping will not work)
while read line
do
paragraphs+=("$line")
done < <(curl -s http://www.imdb.com/title/tt0092455/episodes?season=1 | grep "item_description" | sed 's/<div class=\"item_description\" itemprop=\"description\">/\ /;s/<\/div>/\n/')
echo first paragraph
echo "${paragraphs[0]}"
echo
echo fifth paragraph
echo "${paragraphs[4]}"
echo
echo all paragraphs
for paragraph in "${paragraphs[@]}"
do
echo "$paragraph"
done
edit: note that I have combined the separate sed commands into a single one using semicolons to separate the substitutions.
Last edited by Xyne (2012-12-29 19:01:36)
My Arch Linux Stuff • Forum Etiquette • Community Ethos - Arch is not for everyone
Offline
I actually came up with a solution a bit ago that seems to work. I would however like to understand your solution. Here is mine:
#!/bin/bash
OIFS="$IFS"
IFS=$'|'
s=$(curl -s http://www.imdb.com/title/tt0092455/episodes?season=1 | grep "item_description" | sed 's/<div class=\"item_description\" itemprop=\"description\">//' | sed 's/<\/div>/\|/')
array=($s)
IFS="$OIFS"
echo -e "Season 1 epsode 1\n"
echo ${array[0]}
can you explain to me how this while loop is working? Is the last line of it
done < <(curl -s http://www.imdb.com/title/tt0092455/episodes?season=1 | grep "item_description" | sed 's/<div class=\"item_description\" itemprop=\"description\">/\ /;s/<\/div>/\n/')
feeding lines of text into the $line variable? As I said before I'm not too familiar with bash syntax.
Offline
It's one of the more obscure features of bash. If you wrap a command with "<(" and ")", the expression is replaced with a named pipe containing the output of the command. For example,
cat <(curl https://www.archlinux.org)
would do the equivalent of
mkfifo tmp.fifo
curl https://www.archlinux.org > tmp.fifo &
cat tmp.fifo
The additional "<" (with a space) feeds this "input file" into the while loop, just as "< foo.txt" would feed the contents of the file "foo.txt" into STDIN for some command. For example
while read line
do
echo "$line"
done < foo.txt
would loop over each line of "foo.txt". Replacing "foo.txt" with "<(curl https://www.archlinux.org)" would loop over content of "https://www.archlinux.org" instead.
At first I did the same thing as you did in your previous post (temporarily change IFS and then load the content directly into an array), but I then concluded that it is a bad idea. Unexpected characters in the description will break that code and there are some security risks as well as it could lead to remote code execution, however unlikely it may be for code to make its way into the text. I believe that my suggestion is safer and will handle all possible cases.
Again, note that you can replace
sed 's/<div class=\"item_description\" itemprop=\"description\">//' | sed 's/<\/div>/\|/')
with
sed 's/<div class=\"item_description\" itemprop=\"description\">//;s/<\/div>/\|/')
If you want to buy 2 things from the shop, there's no reason to walk there and back again twice to buy each thing separately.
Last edited by Xyne (2012-12-30 01:34:17)
My Arch Linux Stuff • Forum Etiquette • Community Ethos - Arch is not for everyone
Offline
Note that you can't actually parse HTML with grep.
In Zen they say: If something is boring after two minutes, try it for four. If still boring, try it for eight, sixteen, thirty-two, and so on. Eventually one discovers that it's not boring at all but very interesting.
~ John Cage
Offline