Parse a webpage and return all the URLs mms://***.wmv

Coume · 2009-06-02 20:11:32

Hello,

I would like to create a small script to parse http://jt.france2.fr/20h/ in order to extract the latest News program available:
i.e.:
mms://a988.v101995.c10199.e.vm.akamaistream.net/7/988/10199/3f97c7e6/ftvigrp.download.akamai.com/10199/cappuccino/production/publication/France_2/Autre/2009/S23/40692_HD_20h_20090602.wmv
etc.

I am not sure how to easily to do that (perl - python - bash/awk - or ???)

Thanks in advance for your help!

Ludo

brisbin33 · 2009-06-02 20:29:29

use `lynx -dump $url | grep $searchstring` in many a bash scripts for this sort of thing (curl or wget could also be used). could also replace grep with awk though if you want to isolate the download_url, send it to `wget -O $filename $download_url` then call vlc $filename or something... this should get you started. i see some manpages in your future.

good luck

Last edited by brisbin33 (2009-06-02 20:30:05)

briest · 2009-06-02 20:30:26

What about feeding the page to

grep -o 'mms://[^"]\+'

?

gnud · 2009-06-02 20:35:33

It's actually straightforward, but I had to struggle a bit, because the url you gave is wrong. The actual player page can be found at http://jt.france2.fr/player/20h/index-fr.php - this is embedded in an iframe on the front page.

Now, to get the url from this player page with bash and wget, do

$ wget "http://jt.france2.fr/player/20h/index-fr.php" -O - 2>/dev/null | grep -o -E 'mms://.*\.wmv'
mms://a988.v101995.c10199.e.vm.akamaistream.net/7/988/10199/3f97c7e6/ftvigrp.download.akamai.com/10199/cappuccino/production/publication/France_2/Autre/2009/S23/40692_HD_20h_20090602.wmv

Coume · 2009-06-02 22:00:11

Thank you for all the answers and thanks a lot gnud!

cschep · 2009-06-10 22:36:46

Wow. Slick bash scripting is cool. Hard to read (for me!) though.

Another option is to grab ruby (or python, or any other higher level language you want to learn...) and use a library for parsing HTML. For example in Ruby you could check out Hpricot or Nokigiri.

Not quite as slick of an option, but you could poke around another language and if you ever wanted to extend it, using a framework will make your life a lot easier.

You can't really go wrong. Solving problems in an automated fashion with good tools is certainly a joy.

gnud · 2009-06-10 23:06:34

In this case, Bash is actually (IMHO) simpler.
In two distinct commands, you (1) fretch the html and (2) extract the url.

zenlord · 2009-06-12 13:19:09

Hey, this could prove useful to me too...

I want to expand my addressbook and I have requested a list of my colleagues - that request was denied because of privacy reasons. While they deny my request, all the information can be found on a website without any security measures

So I'm thinking of writing a spider to crawl every http://url/path?id=[1-5000] and save the results in a textfile I can parse with PHP (since that's the only language I have experience with) - I might try bash if it is as simple as this...

Arch Linux

#1 2009-06-02 20:11:32

Parse a webpage and return all the URLs mms://***.wmv

#2 2009-06-02 20:29:29

Re: Parse a webpage and return all the URLs mms://***.wmv

#3 2009-06-02 20:30:26

Re: Parse a webpage and return all the URLs mms://***.wmv

#4 2009-06-02 20:35:33

Re: Parse a webpage and return all the URLs mms://***.wmv

#5 2009-06-02 22:00:11

Re: Parse a webpage and return all the URLs mms://***.wmv

#6 2009-06-10 22:36:46

Re: Parse a webpage and return all the URLs mms://***.wmv

#7 2009-06-10 23:06:34

Re: Parse a webpage and return all the URLs mms://***.wmv

#8 2009-06-12 13:19:09

Re: Parse a webpage and return all the URLs mms://***.wmv

Board footer