You are not logged in.

#1 2008-10-15 17:28:32

delacruz
Member
From: /home/houston
Registered: 2007-12-09
Posts: 102

grab text from a website

I am looking at collecting data from yahoo.  I want to get data from a page that changes time to time.  for example on this page i would like data like the following

15.00    CAC.X    4.00    Down 1.32    3.95    4.10    196    31,084
15.00    CAC.X    4.00    Down 1.32    4.00    4.10    196    31,084
15.00    CAC.X    4.00    Down 1.32    4.00    4.10    196    31,084
15.00    CAC.X    4.00    Down 1.32    4.00    4.15    196    31,084

How can i do this.? I know java and perl and some php.  Could anyone gives me a jump start and tell me what method i should do i.e. what program or script language.   I am not to sure how to tackle with problem.

Offline

#2 2008-10-15 17:42:06

rson451
Member
From: Annapolis, MD USA
Registered: 2007-04-15
Posts: 1,233
Website

Re: grab text from a website

The program links has a dump option which may be of use to you, but are you sure there is no feed for this information that they offer so that scripts don't waste their bandwidth?


archlinux - please read this and this — twice — then ask questions.
--
http://rsontech.net | http://github.com/rson

Offline

#3 2008-10-15 17:52:31

Procyon
Member
Registered: 2008-05-07
Posts: 1,819

Re: grab text from a website

1. Get the html source
2. Write something that can interpret the source
3. Wrap it around a while and add a timer.

2. is the problem, however it's not as hard as you think. You can see the data you want is outside of the tags, except Up/Down.

So... even regex can do it. Here is a simple script, it isn't optimized because it will be easier to read.

while true; do
wget -q -O - 'http://finance.yahoo.com/q/op?s=C&m=2009-01'  | sed -n '/<tr>/s/<tr>/\n/gp' | grep 'CAC.X' | sed 's/down_r[^>]*>/>DOWN/' | sed 's/up_g[^>]*>/>UP/' | sed 's/<[^>]*>/ /g' | sed 's/^ *//' | sed 's/  */  /g'
sleep 60
done

Offline

#4 2008-10-15 17:54:54

phrakture
Arch Overlord
From: behind you
Registered: 2003-10-29
Posts: 7,879
Website

Re: grab text from a website

Personally, I would write a little python script that uses urllib2 and BeautifulSoup to parse it, find the info, and then transform it to your needs

Offline

#5 2008-10-15 18:11:56

delacruz
Member
From: /home/houston
Registered: 2007-12-09
Posts: 102

Re: grab text from a website

thanks Procyon your script works perfect

Offline

#6 2008-10-15 18:58:50

jwcxz
Member
Registered: 2008-09-23
Posts: 239
Website

Re: grab text from a website

phrakture wrote:

Personally, I would write a little python script that uses urllib2 and BeautifulSoup to parse it, find the info, and then transform it to your needs

I prefer to use Python for text manipulation, too.  Even though it may not be the fastest way if you're trying to process a very large amount of data, the code is much cleaner.


-- jwc
http://jwcxz.com/ | blog
dotman - manage your dotfiles across multiple environments
icsy - an alarm for powernappers

Offline

#7 2008-10-15 21:08:06

scrawler
Member
Registered: 2005-06-07
Posts: 318

Re: grab text from a website

I use w3m for everything like this, so I'll mention it even though probably nobody cares.

w3m 'http://finance.yahoo.com/q/op?s=C&m=2009-01' > stuff.txt

then clean the stuff you don't want out of stuff.txt

Offline

#8 2008-10-15 22:02:58

delacruz
Member
From: /home/houston
Registered: 2007-12-09
Posts: 102

Re: grab text from a website

scrawler wrote:

I use w3m for everything like this, so I'll mention it even though probably nobody cares.

w3m 'http://finance.yahoo.com/q/op?s=C&m=2009-01' > stuff.txt

then clean the stuff you don't want out of stuff.txt

i like it!

Offline

#9 2009-02-28 22:51:18

SamC
Member
From: Calgary
Registered: 2008-05-13
Posts: 611
Website

Re: grab text from a website

This is a linux forum. I strongly doubt that your ads, for Windows software, will be appreciated here.

Offline

#10 2009-03-01 03:36:29

skottish
Forum Fellow
From: Here
Registered: 2006-06-16
Posts: 7,942

Re: grab text from a website

SamC wrote:

This is a linux forum. I strongly doubt that your ads, for Windows software, will be appreciated here.

Good call. If you see garbage like this is the future, please report it. Banning IP addresses like that one is a pleasure.

Offline

Board footer

Powered by FluxBB