You are not logged in.

#1 2011-10-14 17:54:52

hiveNzin0
Member
Registered: 2011-10-02
Posts: 84

[Shell Script] extract links from file (solved)

Hello,

I'm looking for a tool to extract all links from a file. The file contains that kind of lines :

./archive?before_time=1316002321.html:                                <img src="http://assets.tumblr.com/images/small_white_checkmark.png"

I found a tool called urlview. It does a good job except that it opens the result in a screen oriented program and I can only get the link one at a time (and I have around 30k).

I found the regexp that urlview uses but I'm very bad with regexp :

(((http|https|ftp|gopher)|mailto)[.:][^ >"\t]*|www\.[-a-z0-9.]+)[^ .,;\t>">\):]

Unfortunately, when I try to use it, I get this error ;

[awesome@Oslo /media/data/tmp 13M]\ $ cat images_list.txt | grep -e (((http|https|ftp|gopher)|mailto)[.:][^ >"\t]*|www\.[-a-z0-9.]+)[^ .,;\t>">\):]
bash: syntax error near unexpected token `('

Does somebody know a tool or what the problem is with the regexp ?

Thank you very much.

Last edited by hiveNzin0 (2011-10-14 19:46:57)

Offline

#2 2011-10-14 18:02:18

rockin turtle
Member
From: Montana, USA
Registered: 2009-10-22
Posts: 227

Re: [Shell Script] extract links from file (solved)

The regex you gave has an unmatched paren.

(((http|https|ftp|gopher)|mailto)[.:][^ >"\t]*|www\.[-a-z0-9.]+)[^ .,;\t>">\):]

should probably be:

(((http|https|ftp|gopher|mailto)[.:][^ >"\t]*|www\.[-a-z0-9.]+)[^ .,;\t>">\):]

Also, you probably need to enclose it in single quotes:

$ cat images_list.txt | grep -e '(((http|https|ftp|gopher|mailto)[.:][^ >"\t]*|www\.[-a-z0-9.]+)[^ .,;\t>">\):]'

Last edited by rockin turtle (2011-10-14 18:07:35)

Offline

#3 2011-10-14 18:03:24

karol
Archivist
Registered: 2009-05-06
Posts: 25,440

Re: [Shell Script] extract links from file (solved)

Offline

#4 2011-10-14 18:24:26

hiveNzin0
Member
Registered: 2011-10-02
Posts: 84

Re: [Shell Script] extract links from file (solved)

Okay I'll look for informations about xmllint thx. I tried your last command rocking turtle but it doesn't give me any results (but I don't have the error anymore).

That's strange because I found it in /etc/urlview/system.urlview

Urlview finds me around 30k links with it (maybe the regexp is wrong and urlview doesn't use this regexp), maybe there is a way to select all the links in urlview).

Anyway, thanks for your help. I'll keep looking.

Offline

#5 2011-10-14 18:27:52

karol
Archivist
Registered: 2009-05-06
Posts: 25,440

Re: [Shell Script] extract links from file (solved)

grep -E '(((http|https|ftp|gopher)|mailto)[.:][^ >"\t]*|www\.[-a-z0-9.]+)[^ .,;\t>">\):] <filename>

seems to work:

[karol@black ~]$ cat filename 
https://bbs.archlinux.org/viewtopic.php?pid=1003560#p1003560
fjkaljflkajd
[karol@black ~]$ grep -e '(((http|https|ftp|gopher)|mailto)[.:][^ >"\t]*|www\.[-a-z0-9.]+)[^ .,;\t>">\):]' filename 
[karol@black ~]$ grep -E '(((http|https|ftp|gopher)|mailto)[.:][^ >"\t]*|www\.[-a-z0-9.]+)[^ .,;\t>">\):]' filename 
https://bbs.archlinux.org/viewtopic.php?pid=1003560#p1003560

Last edited by karol (2011-10-14 18:29:07)

Offline

#6 2011-10-14 19:46:42

hiveNzin0
Member
Registered: 2011-10-02
Posts: 84

Re: [Shell Script] extract links from file (solved)

Okay thank you for your help.

This command line gets me the line with all the markups. I think it will be easier for me to modify the urlview source code to write the clean url in a file than to learn how to use sed and other tools to get a clean link. tongue

Thanks again.

Offline

Board footer

Powered by FluxBB