[Shell Script] extract links from file (solved)

hiveNzin0 · 2011-10-14 17:54:52

Hello,

I'm looking for a tool to extract all links from a file. The file contains that kind of lines :

./archive?before_time=1316002321.html: <img src="http://assets.tumblr.com/images/small_white_checkmark.png"

I found a tool called urlview. It does a good job except that it opens the result in a screen oriented program and I can only get the link one at a time (and I have around 30k).

I found the regexp that urlview uses but I'm very bad with regexp :

(((http|https|ftp|gopher)|mailto)[.:][^ >"\t]*|www\.[-a-z0-9.]+)[^ .,;\t>">\):]

Unfortunately, when I try to use it, I get this error ;

[awesome@Oslo /media/data/tmp 13M]\ $ cat images_list.txt | grep -e (((http|https|ftp|gopher)|mailto)[.:][^ >"\t]*|www\.[-a-z0-9.]+)[^ .,;\t>">\):]
bash: syntax error near unexpected token `('

Does somebody know a tool or what the problem is with the regexp ?

Thank you very much.

Last edited by hiveNzin0 (2011-10-14 19:46:57)

rockin turtle · 2011-10-14 18:02:18

The regex you gave has an unmatched paren.

(((http|https|ftp|gopher)|mailto)[.:][^ >"\t]*|www\.[-a-z0-9.]+)[^ .,;\t>">\):]

should probably be:

(((http|https|ftp|gopher|mailto)[.:][^ >"\t]*|www\.[-a-z0-9.]+)[^ .,;\t>">\):]

Also, you probably need to enclose it in single quotes:

$ cat images_list.txt | grep -e '(((http|https|ftp|gopher|mailto)[.:][^ >"\t]*|www\.[-a-z0-9.]+)[^ .,;\t>">\):]'

Last edited by rockin turtle (2011-10-14 18:07:35)

karol · 2011-10-14 18:03:24

Try xmllint https://bbs.archlinux.org/viewtopic.php … 18#p841718

hiveNzin0 · 2011-10-14 18:24:26

Okay I'll look for informations about xmllint thx. I tried your last command rocking turtle but it doesn't give me any results (but I don't have the error anymore).

That's strange because I found it in /etc/urlview/system.urlview

Urlview finds me around 30k links with it (maybe the regexp is wrong and urlview doesn't use this regexp), maybe there is a way to select all the links in urlview).

Anyway, thanks for your help. I'll keep looking.

karol · 2011-10-14 18:27:52

grep -E '(((http|https|ftp|gopher)|mailto)[.:][^ >"\t]*|www\.[-a-z0-9.]+)[^ .,;\t>">\):] <filename>

seems to work:

[karol@black ~]$ cat filename 
https://bbs.archlinux.org/viewtopic.php?pid=1003560#p1003560
fjkaljflkajd
[karol@black ~]$ grep -e '(((http|https|ftp|gopher)|mailto)[.:][^ >"\t]*|www\.[-a-z0-9.]+)[^ .,;\t>">\):]' filename 
[karol@black ~]$ grep -E '(((http|https|ftp|gopher)|mailto)[.:][^ >"\t]*|www\.[-a-z0-9.]+)[^ .,;\t>">\):]' filename 
https://bbs.archlinux.org/viewtopic.php?pid=1003560#p1003560

Last edited by karol (2011-10-14 18:29:07)

hiveNzin0 · 2011-10-14 19:46:42

Okay thank you for your help.

This command line gets me the line with all the markups. I think it will be easier for me to modify the urlview source code to write the clean url in a file than to learn how to use sed and other tools to get a clean link.

Thanks again.

Arch Linux

#1 2011-10-14 17:54:52

[Shell Script] extract links from file (solved)

#2 2011-10-14 18:02:18

Re: [Shell Script] extract links from file (solved)

#3 2011-10-14 18:03:24

Re: [Shell Script] extract links from file (solved)

#4 2011-10-14 18:24:26

Re: [Shell Script] extract links from file (solved)

#5 2011-10-14 18:27:52

Re: [Shell Script] extract links from file (solved)

#6 2011-10-14 19:46:42

Re: [Shell Script] extract links from file (solved)

Board footer