You are not logged in.

#1 2009-06-02 20:11:32

Coume
Member
From: UK
Registered: 2008-02-10
Posts: 78
Website

Parse a webpage and return all the URLs mms://***.wmv

Hello,

I would like to create a small script to parse http://jt.france2.fr/20h/ in order to extract the latest News program available:
i.e.:
mms://a988.v101995.c10199.e.vm.akamaistream.net/7/988/10199/3f97c7e6/ftvigrp.download.akamai.com/10199/cappuccino/production/publication/France_2/Autre/2009/S23/40692_HD_20h_20090602.wmv
etc.

I am not sure how to easily to do that (perl - python - bash/awk - or ???)

Thanks in advance for your help!

Ludo

Offline

#2 2009-06-02 20:29:29

brisbin33
Member
From: boston, ma
Registered: 2008-07-24
Posts: 1,799
Website

Re: Parse a webpage and return all the URLs mms://***.wmv

use `lynx -dump $url | grep $searchstring` in many a bash scripts for this sort of thing (curl or wget could also be used).  could also replace grep with awk though if you want to isolate the download_url, send it to `wget -O $filename $download_url` then call vlc $filename or something... this should get you started.  i see some manpages in your future.

good luck smile

Last edited by brisbin33 (2009-06-02 20:30:05)

Offline

#3 2009-06-02 20:30:26

briest
Member
From: Katowice, PL
Registered: 2006-05-04
Posts: 468

Re: Parse a webpage and return all the URLs mms://***.wmv

What about feeding the page to

grep -o 'mms://[^"]\+'

?

Offline

#4 2009-06-02 20:35:33

gnud
Member
Registered: 2005-11-27
Posts: 182

Re: Parse a webpage and return all the URLs mms://***.wmv

It's actually straightforward, but I had to struggle a bit, because the url you gave is wrong. The actual player page can be found at http://jt.france2.fr/player/20h/index-fr.php - this is embedded in an iframe on the front page.

Now, to get the url from this player page with bash and wget, do

$ wget "http://jt.france2.fr/player/20h/index-fr.php" -O - 2>/dev/null | grep -o -E 'mms://.*\.wmv'
mms://a988.v101995.c10199.e.vm.akamaistream.net/7/988/10199/3f97c7e6/ftvigrp.download.akamai.com/10199/cappuccino/production/publication/France_2/Autre/2009/S23/40692_HD_20h_20090602.wmv

Offline

#5 2009-06-02 22:00:11

Coume
Member
From: UK
Registered: 2008-02-10
Posts: 78
Website

Re: Parse a webpage and return all the URLs mms://***.wmv

Thank you for all the answers and thanks a lot gnud!

Offline

#6 2009-06-10 22:36:46

cschep
Member
Registered: 2006-12-02
Posts: 125
Website

Re: Parse a webpage and return all the URLs mms://***.wmv

Wow. Slick bash scripting is cool. Hard to read (for me!) though.

Another option is to grab ruby (or python, or any other higher level language you want to learn...) and use a library for parsing HTML. For example in Ruby you could check out Hpricot or Nokigiri.

Not quite as slick of an option, but you could poke around another language and if you ever wanted to extend it, using a framework will make your life a lot easier.

You can't really go wrong. Solving problems in an automated fashion with good tools is certainly a joy.

Offline

#7 2009-06-10 23:06:34

gnud
Member
Registered: 2005-11-27
Posts: 182

Re: Parse a webpage and return all the URLs mms://***.wmv

In this case, Bash is actually (IMHO) simpler.
In two distinct commands, you (1) fretch the html and (2) extract the url.

Offline

#8 2009-06-12 13:19:09

zenlord
Member
From: Belgium
Registered: 2006-05-24
Posts: 1,223
Website

Re: Parse a webpage and return all the URLs mms://***.wmv

Hey, this could prove useful to me too...

I want to expand my addressbook and I have requested a list of my colleagues - that request was denied because of privacy reasons. While they deny my request, all the information can be found on a website without any security measures hmm

So I'm thinking of writing a spider to crawl every http://url/path?id=[1-5000] and save the results in a textfile I can parse with PHP (since that's the only language I have experience with) - I might try bash if it is as simple as this...

Offline

Board footer

Powered by FluxBB