You are not logged in.
Pages: 1
Hello,
I've got a command which I use to recover text from websites, which I then submit to linguistic analysis (I'm a linguistics researcher
). The command might run as follows; for example:
lynx -dump "somewebpage.html" | sed -n '/left-limit/,/right-limit/p' | sed 's/\[.*[.]png\]//g' | sed 's/\[.*[.]gif\]//' | sed 's/\[.*\]//' > newfile
This will take all the text from the webpage between left-limit and right-limit, remove things like [profile.png], [status.gif] and [123], and then print it into "newfile".
Now, what I'd like to be able to do is to turn this into a shellscript, with a variable for the url, so I could type something like: myscript.sh somewebpage.html and let it work from there. Or even have a shell prompt "Enter url:", but I am completely new to scripting and have no idea how to work with variables like this (despite trying to get my head round it!).
Any help, or pointers in the right direction would be very much appreciated!
Last edited by ranger_g (2011-11-27 21:48:51)
Offline
Create myscript.sh with a shebang on the first line, and your lynx/sed concoction on the next line. Replace "somewebpage.html" with "$1" and newfile with "$2". Make it executable, then run it as follows:
myscript.sh <url> <output-file>Offline
Great. Thanks very much for your quick answer. I tried what you suggested, in various forms, and after a few error messages realised that I'd have to put the url in inverted commas "url", since this is what lynx likes.
I'm also beginning to understand how these variables might be useful!
Offline
You already had the URL in double quotes in your initial example - and I had also included them in my suggested change. ![]()
Offline
Thanks, yes. I tried it as was, without quotes for the command, but get the error:
./myscript.sh: line 2: : Aucun fichier ou dossier de ce type output-file : commande introuvable. I don't seem to be able to use the quotes as part of the variable, i.e. $"1" rather than "$1".
In short, what works:
Quotes around both variables and quotes around the url on the command-line OR
No quotes around either variable and quotes around the command-line url.
It'd be nice to understand what's going on: I imagine that quotes are not available in the syntax, or have to be escaped in some way.
Last edited by ranger_g (2011-11-28 07:04:49)
Offline
Can you please post the script, the version which returns the error you posted?
BTW, run
LC_ALL=C <command>to get output in English.
If you enabled C locale, it should work.
Offline
Thanks again. I'm impressed by the speed of this forum's reactions! Sorry the error messages were in French... I didn't realise.
Here's the command with errors (I am in the right directory):
LC_ALL=C ./myscript.sh http://forum.wordreference.com/showthread.php?t=2035381&s=ebfb3a9927bcc92c8c9f0519fead02f5 output-file
[1] 2238
./myscript.sh: line 2: $2: ambiguous redirect
output-file : commande introuvable ("Commande introuvable": no such command.)
And here's myscript.sh:
#!/bin/bash
lynx -dump "$1" | sed -n '/Thread:/,/Quick Navigation/p' | sed 's/\[.*[.]png\]//g' | sed 's/\[.*[.]gif\]//' | sed 's/\[.*\]//' > $2Again, with this command (i.e. quotes):
./myscript.sh "http://forum.wordreference.com/showthread.php?t=2035381&s=ebfb3a9927bcc92c8c9f0519fead02f5" output-fileand $1... $2 or "$1"..."$2" in the script, it all works fine. If that just means I have to type the quotes, that's not a big problem!
Offline
Ooops, , I type faster than I think. Nothing to see here.
Last edited by karol (2011-11-28 07:34:51)
Offline
Well, just in case this interests anybody, here's the final script. It might always help a fellow linguist
.
#!/bin/bash
# The next two lines ask for the url and for the output file name; these are variables in the command
echo Please, enter the url
read URL
echo Please enter your output filename
read OUTPUT
# This lynx command takes the text from the webpage, with image links etc.
lynx -dump $URL |
# These sed commands 1. eliminate all but the text between Thread: and Quick Navigation, which work as markers of left- and right-limits; 2. and 3. remove text of the form [image.png] or [image.gif]; 4. removes the digits between square brackets. Then the output is transferred to the named file.
sed -n '/Thread:/,/Quick Navigation/p' | sed 's/\[.*[.]png\]//g' | sed 's/\[.*[.]gif\]//' | sed 's/\[.*\]//' > $OUTPUT Of course the sed commands have to be adapted according to whatever the html code is (in particular left- and right-limits). I was aiming at the wordreference.com forums.
Offline
Pages: 1