You are not logged in.
Hello,
I'm looking for a tool to extract all links from a file. The file contains that kind of lines :
./archive?before_time=1316002321.html: <img src="http://assets.tumblr.com/images/small_white_checkmark.png"
I found a tool called urlview. It does a good job except that it opens the result in a screen oriented program and I can only get the link one at a time (and I have around 30k).
I found the regexp that urlview uses but I'm very bad with regexp :
(((http|https|ftp|gopher)|mailto)[.:][^ >"\t]*|www\.[-a-z0-9.]+)[^ .,;\t>">\):]
Unfortunately, when I try to use it, I get this error ;
[awesome@Oslo /media/data/tmp 13M]\ $ cat images_list.txt | grep -e (((http|https|ftp|gopher)|mailto)[.:][^ >"\t]*|www\.[-a-z0-9.]+)[^ .,;\t>">\):]
bash: syntax error near unexpected token `('
Does somebody know a tool or what the problem is with the regexp ?
Thank you very much.
Last edited by hiveNzin0 (2011-10-14 19:46:57)
Offline
The regex you gave has an unmatched paren.
(((http|https|ftp|gopher)|mailto)[.:][^ >"\t]*|www\.[-a-z0-9.]+)[^ .,;\t>">\):]
should probably be:
(((http|https|ftp|gopher|mailto)[.:][^ >"\t]*|www\.[-a-z0-9.]+)[^ .,;\t>">\):]
Also, you probably need to enclose it in single quotes:
$ cat images_list.txt | grep -e '(((http|https|ftp|gopher|mailto)[.:][^ >"\t]*|www\.[-a-z0-9.]+)[^ .,;\t>">\):]'
Last edited by rockin turtle (2011-10-14 18:07:35)
Offline
Offline
Okay I'll look for informations about xmllint thx. I tried your last command rocking turtle but it doesn't give me any results (but I don't have the error anymore).
That's strange because I found it in /etc/urlview/system.urlview
Urlview finds me around 30k links with it (maybe the regexp is wrong and urlview doesn't use this regexp), maybe there is a way to select all the links in urlview).
Anyway, thanks for your help. I'll keep looking.
Offline
grep -E '(((http|https|ftp|gopher)|mailto)[.:][^ >"\t]*|www\.[-a-z0-9.]+)[^ .,;\t>">\):] <filename>
seems to work:
[karol@black ~]$ cat filename
https://bbs.archlinux.org/viewtopic.php?pid=1003560#p1003560
fjkaljflkajd
[karol@black ~]$ grep -e '(((http|https|ftp|gopher)|mailto)[.:][^ >"\t]*|www\.[-a-z0-9.]+)[^ .,;\t>">\):]' filename
[karol@black ~]$ grep -E '(((http|https|ftp|gopher)|mailto)[.:][^ >"\t]*|www\.[-a-z0-9.]+)[^ .,;\t>">\):]' filename
https://bbs.archlinux.org/viewtopic.php?pid=1003560#p1003560
Last edited by karol (2011-10-14 18:29:07)
Offline
Okay thank you for your help.
This command line gets me the line with all the markups. I think it will be easier for me to modify the urlview source code to write the clean url in a file than to learn how to use sed and other tools to get a clean link.
Thanks again.
Offline