You are not logged in.

#1 2011-11-27 21:44:43

ranger_g
Member
Registered: 2011-11-27
Posts: 5

Bash script help

Hello,
I've got a command which I use to recover text from websites, which I then submit to linguistic analysis (I'm a linguistics researcher smile ). The command might run as follows; for example:
lynx -dump "somewebpage.html" | sed -n '/left-limit/,/right-limit/p' | sed 's/\[.*[.]png\]//g' | sed 's/\[.*[.]gif\]//' | sed 's/\[.*\]//' > newfile
This will take all the text from the webpage between left-limit and right-limit, remove things like [profile.png], [status.gif] and [123], and then print it into "newfile".
Now, what I'd like to be able to do is to turn this into a shellscript, with a variable for the url, so I could type something like: myscript.sh somewebpage.html and let it work from there. Or even have a shell prompt "Enter url:", but I am completely new to scripting and have no idea how to work with variables like this (despite trying to get my head round it!).
Any help, or pointers in the right direction would be very much appreciated!

Last edited by ranger_g (2011-11-27 21:48:51)

Offline

#2 2011-11-27 21:54:27

tomk
Forum Fellow
From: Ireland
Registered: 2004-07-21
Posts: 9,839

Re: Bash script help

Create myscript.sh with a shebang on the first line, and your lynx/sed concoction on the next line. Replace "somewebpage.html" with "$1" and newfile with "$2". Make it executable, then run it as follows:

myscript.sh <url> <output-file>

Offline

#3 2011-11-27 22:12:04

ranger_g
Member
Registered: 2011-11-27
Posts: 5

Re: Bash script help

Great. Thanks very much for your quick answer. I tried what you suggested, in various forms, and after a few error messages realised that I'd have to put the url in inverted commas "url", since this is what lynx likes.
I'm also beginning to understand how these variables might be useful!

Offline

#4 2011-11-27 22:21:26

tomk
Forum Fellow
From: Ireland
Registered: 2004-07-21
Posts: 9,839

Re: Bash script help

You already had the URL in double quotes in your initial example - and I had also included them in my suggested change. smile

Offline

#5 2011-11-28 07:02:35

ranger_g
Member
Registered: 2011-11-27
Posts: 5

Re: Bash script help

Thanks, yes. I tried it as was, without quotes for the command, but get the error:

./myscript.sh: line 2: : Aucun fichier ou dossier de ce type output-file : commande introuvable.

I don't seem to be able to use the quotes as part of the variable, i.e. $"1" rather than "$1".
In short, what works:
Quotes around both variables and quotes around the url on the command-line OR
No quotes around either variable and quotes around the command-line url.
It'd be nice to understand what's going on: I imagine that quotes are not available in the syntax, or have to be escaped in some way.

Last edited by ranger_g (2011-11-28 07:04:49)

Offline

#6 2011-11-28 07:09:55

karol
Archivist
Registered: 2009-05-06
Posts: 25,440

Re: Bash script help

Can you please post the script, the version which returns the error you posted?


BTW, run

LC_ALL=C <command>

to get output in English.
If you enabled C locale, it should work.

Offline

#7 2011-11-28 07:28:19

ranger_g
Member
Registered: 2011-11-27
Posts: 5

Re: Bash script help

Thanks again. I'm impressed by the speed of this forum's reactions! Sorry the error messages were in French... I didn't realise.
Here's the command with errors (I am in the right directory):

LC_ALL=C ./myscript.sh http://forum.wordreference.com/showthread.php?t=2035381&s=ebfb3a9927bcc92c8c9f0519fead02f5 output-file
[1] 2238
./myscript.sh: line 2: $2: ambiguous redirect
output-file : commande introuvable 

("Commande introuvable": no such command.)
And here's myscript.sh:

#!/bin/bash
lynx -dump "$1" | sed -n '/Thread:/,/Quick Navigation/p' | sed 's/\[.*[.]png\]//g' | sed 's/\[.*[.]gif\]//' | sed 's/\[.*\]//' > $2

Again, with this command (i.e. quotes):

./myscript.sh "http://forum.wordreference.com/showthread.php?t=2035381&s=ebfb3a9927bcc92c8c9f0519fead02f5" output-file

and $1... $2 or "$1"..."$2" in the script, it all works fine. If that just means I have to type the quotes, that's not a big problem!

Offline

#8 2011-11-28 07:31:20

karol
Archivist
Registered: 2009-05-06
Posts: 25,440

Re: Bash script help

Ooops, , I type faster than I think. Nothing to see here.

Last edited by karol (2011-11-28 07:34:51)

Offline

#9 2011-11-28 13:19:00

ranger_g
Member
Registered: 2011-11-27
Posts: 5

Re: Bash script help

Well, just in case this interests anybody, here's the final script. It might always help a fellow linguist wink.

#!/bin/bash
# The next two lines ask for the url and for the output file name; these are variables in the command
echo Please, enter the url
                read URL
echo Please enter your output filename
		read OUTPUT
# This lynx command takes the text from the webpage, with image links etc.
lynx -dump $URL | 
# These sed commands 1. eliminate all but the text between Thread: and Quick Navigation, which work as markers of left- and right-limits; 2. and 3. remove text of the form [image.png] or [image.gif]; 4. removes the digits between square brackets. Then the output is transferred to the named file.
sed -n '/Thread:/,/Quick Navigation/p' | sed 's/\[.*[.]png\]//g' | sed 's/\[.*[.]gif\]//' | sed 's/\[.*\]//' > $OUTPUT 

Of course the sed commands have to be adapted according to whatever the html code is (in particular left- and right-limits). I was aiming at the wordreference.com forums.

Offline

Board footer

Powered by FluxBB