You are not logged in.

#1 2018-06-24 19:05:43

ss
Member
Registered: 2018-06-04
Posts: 89

How to write bash script which takes data from STDIN & user keypress?

I am trying to write a little bash script to parse the download links from my emails in mutt.
Since I want to evoke download directly in mutt and have this line in ~/.muttrc

macro index,pager \ca "<pipe-message>getall<Enter>" "Download all links"

, my script needs to take its data from STDIN. ( or not?)

My script looks something like this

#!/bin/bash
# getall
    
    ## extract url from messages
    # 1. parse html
    # 2. remove '=' at the end of line and the newline
    # 3. replace space with newline
    # 4. remove duplicated links
    # 5. Show extracted URLs for check up
    URLs=$(w3m -I "utf-8" -T text/html |\
        sed -e ':a;N;$!ba;s/=\n//g;s/=0D//g' |\
        tr ' ' '\n' |\
        grep "some known hosts" |\
        sort | uniq
        )
    echo "$URLs"

    # user key input
    while true; do
        read -p "Download these links? [Y/n]" key
        case $key in
        [yY]* )
            # replace newline with space
            URLs=$(echo "$URLs" | sed -e 's/\n/ /')
            # queue all links to pyLoad
            pyLoadCli add $URLs
            pyLoadCli status
            break;;
        [nN]* )
            echo "Download nothing"
            exit;;
        * )
            echo "Dude, Y or n?";;
    esac
    done

When I press ctrl-a in mutt, URLs can be extracted correctly. However, the accept user key press part won't work because STDIN is flooded with the emails and it just keeps asking "Dude, Y or n?".

How should I change the program?

Last edited by ss (2018-06-24 19:08:35)

Offline

#2 2018-06-24 22:01:05

seth
Member
Registered: 2012-09-03
Posts: 51,046

Re: How to write bash script which takes data from STDIN & user keypress?

It won't accept the key because the terminal is not the stdin
I assume that you'll have to run 2 passes. One to dump the relevant data from pipe-message into a temporary file and another one with an interactive shell that reads that file and asks you what to do with it.

Offline

#3 2018-06-24 22:33:10

Trilby
Inspector Parrot
Registered: 2011-11-29
Posts: 29,523
Website

Re: How to write bash script which takes data from STDIN & user keypress?

No need for two passes, just reopen stdin to be from the controling terminal:

# read from pipe here

exec < /dev/tty

# read from keyboard here

On a different note, that w3m pipeline looks far more complicated than it needs to be.  You should (almost) never need to pipe between all those text processing tools; just pick one.

There's definitely no need to later convert newlines to spaces, the unqoted variable expansion will do this anyways.

Last edited by Trilby (2018-06-24 22:36:47)


"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" -  Richard Stallman

Offline

#4 2018-06-24 23:25:22

ss
Member
Registered: 2018-06-04
Posts: 89

Re: How to write bash script which takes data from STDIN & user keypress?

Trilby wrote:

No need for two passes, just reopen stdin to be from the controling terminal:

# read from pipe here

exec < /dev/tty

# read from keyboard here

This is great!


Trilby wrote:

On a different note, that w3m pipeline looks far more complicated than it needs to be.  You should (almost) never need to pipe between all those text processing tools; just pick one.

I'm glad you take on that part. I always felt I overcomplicate this kind of text processing commands due to limited knowledge and skill. Can we talk about this example a little bit more?
The crazy html mails which contain the links I want to parse, in a much simplified reduction, look something like this

=0D
=0D
https://bbs.archlinux.org https://bbs.=
archlin=
ux.org
=0D
=0D

Some of the links are seperated into several lines. But the cut points are always '=' followed by a new line.

This part was taken from internet. I need it to join the broken URLs. It also deals with the edge condition of coming to the last line.   

sed -e ':a;N;$!ba;s/=\n//g;s/=0D//g'

While the `tr` and `grep` part can clearly be dealt with `sed`, how can I remove duplicate lines with sed as simple and clear as `sort | uniq`?

Basically I also prefer not mixing too much tools which actually accomplish the same thing. But I also think too much `sed` is not very readable.
I think `sed` and `awk` are very powerful, but more general-purpose tools. Sometimes, if there is another core tool which is more tailor-made for my purpose, maybe it is not a bad idea to use it?

Trilby wrote:

There's definitely no need to later convert newlines to spaces, the unqoted variable expansion will do this anyways.

You are right. I didn't know that.

Last edited by ss (2018-06-24 23:34:15)

Offline

#5 2018-06-25 00:24:01

Trilby
Inspector Parrot
Registered: 2011-11-29
Posts: 29,523
Website

Re: How to write bash script which takes data from STDIN & user keypress?

Dedupping in sed is far from practical, but as you note the tr can be done in sed and definitely the grep can be done in sed.  It's hard to write though as the initial looks like placeholder - I'm not sure exactly what it needs to do.  But if you are using sed, I'd ditch the deletion of the non interesting lines and just use the -n flag to sed and then only print the lines you actually want.

But if you want dedupping in the same process, awk is very good at that.  This does not yet do what the grep filter would, but that'd be easy to add:

/=$/ {
	sub(/=$/, "", $0)
	url=$0
	while (getline && $1 ~ /^[^=]/) {
		sub(/=$/, "", $0)
		url=url$0
	}
	split(url, parts)
	for (i in parts)
		urls[parts[i]] = 1
}
END {
	for (url in urls)
		print url
}

But the big question would be why are you putting this through w3m to format just to try to undo and work around that formatting?  Raw html would be much much easier to work with.

Or if you use w3m, get it to do the work for you:

w3m -o display_link_number=1 | awk '/^References/ { ON=1; getline; } ON { links[$2]=1; } END { for (link in links) print link; }'

Last edited by Trilby (2018-06-25 00:46:50)


"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" -  Richard Stallman

Offline

#6 2018-06-25 00:46:31

ss
Member
Registered: 2018-06-04
Posts: 89

Re: How to write bash script which takes data from STDIN & user keypress?

Trilby wrote:

But the big question would be why are you putting this through w3m to format just to try to undo and work around that formatting?  Raw html would be much much easier to work with.

Believe it or not, the links I want to "parse" are actually all written in plain text in the messages. Other "real" hyper-links don't interest me.
But I can only get the messages in the format of html mail. The URLs I'm interest in are therefore sometimes broken down into several lines, or duplicated because the literal link also has a hyperlink ... Here again we can see how crazy the world has become.

Maybe I am doing some thing utterly stupid and there is a very simple way to extract those links.
I didn't think about it too much before I gave a test run with urlscan. It could detect every possible link except the ones I need. So I decided to try it myself.

Offline

#7 2018-06-25 00:48:12

Trilby
Inspector Parrot
Registered: 2011-11-29
Posts: 29,523
Website

Re: How to write bash script which takes data from STDIN & user keypress?

ss wrote:

Believe it or not, the links I want to "parse" are actually all written in plain text in the messages.

In that case, why are you using w3m at all?  awk and sed parse text ... take out the middle man.


"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" -  Richard Stallman

Offline

#8 2018-06-25 00:52:09

ss
Member
Registered: 2018-06-04
Posts: 89

Re: How to write bash script which takes data from STDIN & user keypress?

I just find out that edbrowse could be a promising solution for my particular case. A brief test showed a much better result than the w3m parser. And apparently edbrowse is much more powerful and flexible in terms of post processing. But I would need some time to learn how to use it since I don't have experience with ed.

Offline

#9 2018-06-25 00:55:57

ss
Member
Registered: 2018-06-04
Posts: 89

Re: How to write bash script which takes data from STDIN & user keypress?

Trilby wrote:
ss wrote:

Believe it or not, the links I want to "parse" are actually all written in plain text in the messages.

In that case, why are you using w3m at all?  awk and sed parse text ... take out the middle man.

Maybe you are right. I will give it a try.

My thought was straightforward and naive. I saw the original html, which is full of garbage. Then I saw the autoview result in mutt which is generated by w3m and seems very human readable. Consequently, I thought it was a good idea to work with the w3m-parsed text instead of the original html.

Offline

#10 2018-06-25 01:21:27

jasonwryan
Anarchist
From: .nz
Registered: 2009-05-09
Posts: 30,424
Website

Re: How to write bash script which takes data from STDIN & user keypress?

Like I said in your other thread, forget about parsing and just use extract_url with the -l flag.


Arch + dwm   •   Mercurial repos  •   Surfraw

Registered Linux User #482438

Offline

#11 2018-06-25 09:15:08

ss
Member
Registered: 2018-06-04
Posts: 89

Re: How to write bash script which takes data from STDIN & user keypress?

jasonwryan wrote:

Like I said in your other thread, forget about parsing and just use extract_url with the -l flag.

Thank you for making me aware of extract_url again. Also I should have read the mutt article in Arch wiki more thoroughly. The package available for Arch is aur/perl-extract-url.
It looks very promising at the first place, and seems to work better than urlscan. But in my particlar case, it can not correctly parse the links I need, at least with my first attempts.
Since I am extracting plain text urls from html mails, it produces better results with `extract_url -t -l`

The main problem is that it doesn't handle urls which are divided into several lines. For instance

=0D<br>
 https:/=
 /archlinu=
 x.org http://bbs.archli=
 nux.org=0D<br>
=0D<br>

can only be detected as

http://bbs.archli%3D

The example above is constructed, but the nature of the problem is exactly the same in my mails. I got truncated links.
Maybe it just needs a minor change in the source or config to produce the right result for my case, but I don't know perl, yet.

On the other hand, the little script I put together myself can at least parse the urls correctly for me, although it is probably trying to invent the wheel while other's wheels are far better.

Offline

#12 2018-06-25 10:09:49

progandy
Member
Registered: 2012-05-17
Posts: 5,190

Re: How to write bash script which takes data from STDIN & user keypress?

That looks like quoted-printable encoding. A macro like this should make mutt decode it first:

macro index,pager \ca "<enter-command>set pipe_decode=yes<enter>\
<pipe-message>YOUR_EXTRACT_COMMAND<Enter>\
<enter-command>set pipe_decode=no<enter>"" "Download all links"

Edit: extract_url should be able to do the quoted-printable decoding by itself, though. Do not use the -t option!
Edit: that only works, if you give extract url the whole pipe-message output including mail headers.

Last edited by progandy (2018-06-25 10:17:43)


| alias CUTF='LANG=en_XX.UTF-8@POSIX ' |

Offline

#13 2018-06-25 10:15:17

ss
Member
Registered: 2018-06-04
Posts: 89

Re: How to write bash script which takes data from STDIN & user keypress?

Trilby wrote:

But if you want dedupping in the same process, awk is very good at that.  This does not yet do what the grep filter would, but that'd be easy to add:

/=$/ {
	sub(/=$/, "", $0)
	url=$0
	while (getline && $1 ~ /^[^=]/) {
		sub(/=$/, "", $0)
		url=url$0
	}
	split(url, parts)
	for (i in parts)
		urls[parts[i]] = 1
}
END {
	for (url in urls)
		print url
}

This is much better than the sed part that handles `=$\n`. Way more clear and readable, especially for me in 6 months:)

But it turns out that I still need w3m or something similar as html pre-filter before passing it to awk and co.
Otherwise, by feeding the raw html alone, I will get results like

https://bbs.archlinux.org
https://bbs.archlinux.org=0D<br>
https://bbs.archlinux.org<br>&gt;<br>&gt;<br>&gt;<br>&gt;<br><br>

Or I have to remove the `=0D<br>` with my own code.

Offline

#14 2018-06-25 10:36:59

Trilby
Inspector Parrot
Registered: 2011-11-29
Posts: 29,523
Website

Re: How to write bash script which takes data from STDIN & user keypress?

ss wrote:

Or I have to remove the `=0D<br>` with my own code.

That's easy to do - but I couldn't provide exact code as you've not yet provided exact input.

But I agree with JWR and Progandy who both are giving specific examples of what I've been getting at: get the original in the most useful format first.  Incremental munging and mangling like this only leads to headaches.  This is basically a subtle form of an X-Y problem: you need to extract (a subset of) urls from emails, and you think the best way to do that is to start with one flavor of output from mutt passed through one specific invocation of w3m into sed/awk/grep.  But rather than fine tuning the sed/awk/grep parts of this process, you should go back to the start and check those two very large assumptions.

I don't know what your actual emails look like, so I can't give specifics, but I am confident that these assumptions likely do not hold.

Last edited by Trilby (2018-06-25 10:37:13)


"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" -  Richard Stallman

Offline

#15 2018-06-25 10:40:20

ss
Member
Registered: 2018-06-04
Posts: 89

Re: How to write bash script which takes data from STDIN & user keypress?

progandy wrote:

That looks like quoted-printable encoding. A macro like this should make mutt decode it first:

macro index,pager \ca "<enter-command>set pipe_decode=yes<enter>\
<pipe-message>YOUR_EXTRACT_COMMAND<Enter>\
<enter-command>set pipe_decode=no<enter>"" "Download all links"

Edit: extract_url should be able to do the quoted-printable decoding by itself, though. Do not use the -t option!
Edit: that only works, if you give extract url the whole pipe-message output including mail headers.

The quoted-printable decoding is handled with the marco you suggested. But `extract_url -l` still can't get the multiline URLs correctly. 

To be more specific, the raw html contains links like this:

https://objects-us-west-1.dream.io/xxxxxxxxxx-dh-data-backup/loose/=
xxx%xxx%xxx%xxx.7z https://objects-us-west-1.dre=
am.io/xxxxxxxxxx-dh-data-backup/loose/xxx%xxx%=
xxx%xxx.7z=0D<br>
=0D<br>
 =0D<br>
=0D<br>

The output I can get with extract_url is

https://objects-us-west-1.dream.io/xxxxxxxxxx-dh-data-backup/loose/

URLs broken over several lines are not repaired correctly.

Offline

#16 2018-06-25 10:45:12

ss
Member
Registered: 2018-06-04
Posts: 89

Re: How to write bash script which takes data from STDIN & user keypress?

progandy wrote:

Edit: that only works, if you give extract url the whole pipe-message output including mail headers.

~/.muttrc
macro index,pager \cb "<enter-command>set pipe_decode=yes<enter> <pipe-message>extract_url -l | grep object<Enter> <enter-command>set pipe_decode=no<enter>" "Download all links"

I think the mail headers are included? (I am really asking)

Last edited by ss (2018-06-25 10:55:43)

Offline

#17 2018-06-25 10:52:30

ss
Member
Registered: 2018-06-04
Posts: 89

Re: How to write bash script which takes data from STDIN & user keypress?

I would love to just attach the original raw html mails and it certainly would make it easier for others to help me. But those mails are from a private group and contain a lot of email addresses and information. There is nothing really secret or of any big deal. But out of respect to other group members, I would rather not post it.

I didn't know the mail headers are also relevant. Sorry.

Offline

#18 2018-06-25 10:56:12

progandy
Member
Registered: 2012-05-17
Posts: 5,190

Re: How to write bash script which takes data from STDIN & user keypress?

The edit was an alternative to pipe_decode, you should not use both. Remove the pipe_decode commands and let extract_url handle everything.

macro index,pager \cb "<pipe-message>extract_url -l<enter>" "Extract all links"

Here is a sample mail I used to test it:

From: me@example.com
MIME-Version: 1.0
Content-Type: text/html
Content-Transfer-Encoding: quoted-printable
To: you@example.com

<html><head><title>example.com mail</title></head>
<body>
https://objects-us-west-1.dream.io/xxxxxxxxxx-dh-data-backup/loose/=
xxx%xxx%xxx%xxx.7z https://objects-us-west-1.dre=
am.io/xxxxxxxxxx-dh-data-backup/loose/xxx%xxx%=
xxx%xxx.7z=0D<br>
=0D<br>
 =0D<br>
=0D<br>
</body>
</html>
extract_url -l <test.mbox

Last edited by progandy (2018-06-25 10:59:21)


| alias CUTF='LANG=en_XX.UTF-8@POSIX ' |

Offline

#19 2018-06-25 11:07:36

ss
Member
Registered: 2018-06-04
Posts: 89

Re: How to write bash script which takes data from STDIN & user keypress?

Trilby wrote:

 
This is basically a subtle form of an X-Y problem: you need to extract (a subset of) urls from emails, and you think the best way to do that is to start with one flavor of output from mutt passed through one specific invocation of w3m into sed/awk/grep.  But rather than fine tuning the sed/awk/grep parts of this process, you should go back to the start and check those two very large assumptions.

I certainly agree with you about trying to turn the right screw.
In fact, I didn't and don't *think* the w3m filter is the best. It's just the only one I know about. I asked in another thread about the html preprocessor part, so I certainly would like to get the format right at the first place and keep customized code with awk, sed etc. as little as possible. 

I don't actually use w3m. I'm using it here just like people use it for terminal image preview for ranger. I am more than happy to learn about alternative, better solutions.

Offline

#20 2018-06-25 11:43:18

ss
Member
Registered: 2018-06-04
Posts: 89

Re: How to write bash script which takes data from STDIN & user keypress?

progandy wrote:

The edit was an alternative to pipe_decode, you should not use both. Remove the pipe_decode commands and let extract_url handle everything.

Your example also works for me.

But with

macro index,pager \cb "<pipe-message>extract_url -l<enter>" "Extract all links"

I still get

https://objects-us-west-1.dream.io/xxxxxxxxxx-dh-data-backup/loose/

So the problem is not extract_url and the part of url I posted was not enough. It must have something to do with the mail format itself.

I don't pretend to know the following part is relevant. But the orignal message is 2500 lines and I don't have the capacity to censor all personal information. Please ask me to provide other parts if you think that may help.

MIME-Version: 1.0
Precedence: bulk
From: 
To: 
Subject:
X-Yahoo-Newman-Property: groups-digest-ff-m
Reply-To: "No Reply"<notify@yahoo.com>
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/htm=
l4/strict.dtd">
<html>
<head>
<style type=3D"text/css">
<!--
#ygrp-mkp {
  border: 1px solid #d8d8d8;
  font-family: Arial;
  margin: 10px 0;
  padding: 0 10px;
}

#ygrp-mkp hr {
  border: 1px solid #d8d8d8;
}

#ygrp-mkp #hd {
  color: #628c2a;
  font-size: 85%;
  font-weight: 700;
  line-height: 122%;
  margin: 10px 0;
}

#ygrp-mkp #ads {
  margin-bottom: 10px;
}

#ygrp-mkp .ad {
  padding: 0 0;
}

#ygrp-mkp .ad p {
  margin: 0;
}

#ygrp-mkp .ad a {
  color: #0000ff;
  text-decoration: none;
}
#ygrp-sponsor #ygrp-lc {
  font-family: Arial;
}

#ygrp-sponsor #ygrp-lc #hd {
  margin: 10px 0px;
  font-weight: 700;
  font-size: 78%;
  line-height: 122%;
}

#ygrp-sponsor #ygrp-lc .ad {
  margin-bottom: 10px;
  padding: 0 0;
}
-->
[/style]


<title>XXX</title>
</head>
<body style=3D"background-color: #ffffff;">

...

  <div style=3D"clear:both" > </div>
        <div style=3D"float:left; margin-top: 22px; margin-bottom: 6px; mar=
gin-left: 40px; font-family: arial, helvetica, clean, sans-serif; color: #3=
33333; font-size: 13px; font-style: normal; font-variant: normal; font-weig=
ht: normal; text-decoration: none; line-height:19px;">
  Lorem ipsum dolor sit amet, consectetur adipiscing elit,  =0D<br>
=0D<br>
 =0D<br>
=0D<br>
 Lorem ipsum dolor sit amet, consectetur adipiscing elit, =0D<br>
  =0D<br>
Lorem ,  =E2=80=93 Lorem ipsum dolor sit amet, consectetur adipiscing elit, =0D<br>
=0D<br>
Lorem ipsum dolor sit amet,  va=
consectetur adipiscing elit, =0D<br>
=0D<br>
 .=0D<br>
 =0D<br>
=0D<br>
 =0D<br>
=0D<br>
adipiscing elit, .=0D<br>
 =0D<br>
=0D<br>
https://objects-us-west-1.dream.io/xxxxxxxxxx-dh-data-backup/loose/=
xxx%xxx%xxx%xxx.7z https://objects-us-west-1.dre=
am.io/xxxxxxxxxx-dh-data-backup/loose/xxx%xxx%=
xxx%xxx.7z=0D<br>
=0D<br>
=0D<br>
 =0D<br>
=0D<br>
</div>

...

Offline

#21 2018-06-25 12:14:34

ss
Member
Registered: 2018-06-04
Posts: 89

Re: How to write bash script which takes data from STDIN & user keypress?

<enter-command>set pipe_decode=yes<enter>

will remove the trailing `=`, but not joining the broken lines, which in my opinion makes subsequent URL parsing more difficult.

As far as I understand, `=\n` indicate an unnatural line break, which should be joint together before further processing. 

With progandy's example mail,

macro index,pager \cb "<enter-command>set pipe_decode=yes<enter> <pipe-message>grep -A 5 'object'<Enter> <enter-command>set pipe_decode=no<enter>"

produces

https://objects-us-west-1.dream.io/xxxxxxxxxx-dh-data-backup/loose/
xxx%xxx%xxx%xxx.7z https://objects-us-west-1.dream.io/xxxxxxxxxx-dh-data-backup
/loose/xxx%xxx%xxx%xxx.7z

which is even harder to process than with `=\n`, if I'm not mistaken.

Offline

#22 2018-06-25 12:46:51

Trilby
Inspector Parrot
Registered: 2011-11-29
Posts: 29,523
Website

Re: How to write bash script which takes data from STDIN & user keypress?

New awk script for these conditions ...

/=$/ {
   url=$0
   while (getline && $1 ~ /^[^=]/) url=url$0
   gsub(/=(0D.*)*/, "", url)
   # add another sub here if some urls don't end in =0D...
   #   this might match <br> or the like - this depends on what your input is
   if (url !~ /^http/) next
   split(url, parts)
   for (i in parts) urls[parts[i]] = 1
}
END {
   for (url in urls) print url
}

This may not work as is for every possible format of url you get - but as the possible formats have not been defined very well I've just added a comment to adjust as needed.

Of course I suppose this also assumes that every url is broken (as anything with a first/only line not ending in a '=' is not parsed).  But solutions are hard to come by before the problem is clearly defined.

Last edited by Trilby (2018-06-25 12:48:10)


"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" -  Richard Stallman

Offline

#23 2018-06-25 22:08:39

ss
Member
Registered: 2018-06-04
Posts: 89

Re: How to write bash script which takes data from STDIN & user keypress?

Trilby wrote:

New awk script for these conditions ...

/=$/ {
   url=$0
   while (getline && $1 ~ /^[^=]/) url=url$0
   gsub(/=(0D.*)*/, "", url)
   # add another sub here if some urls don't end in =0D...
   #   this might match <br> or the like - this depends on what your input is
   if (url !~ /^http/) next
   split(url, parts)
   for (i in parts) urls[parts[i]] = 1
}
END {
   for (url in urls) print url
}

This may not work as is for every possible format of url you get - but as the possible formats have not been defined very well I've just added a comment to adjust as needed.

Of course I suppose this also assumes that every url is broken (as anything with a first/only line not ending in a '=' is not parsed).  But solutions are hard to come by before the problem is clearly defined.

Hmm... I have to think about the whole thing again. Also about how can I better get helped. I know it is really annoying for people who are willing to help but could only get some second-hand information from the poster. However, due to the reason stated in an earlier post, I am not able to present the original files at the moment.

Not all the links are broken into lines. Neither are all duplicated. Not even every link begin with http (although this could be fixed easily). Your scripts are really helpful for me, as every other commenter's suggestions in the thread. However, I do feel that failing to give out the original data set in unedited form does prevent others to help me in the most efficient way.

Offline

#24 2018-06-25 23:05:24

Trilby
Inspector Parrot
Registered: 2011-11-29
Posts: 29,523
Website

Re: How to write bash script which takes data from STDIN & user keypress?

ss wrote:

Not all the links are broken into lines. Neither are all duplicated. Not even every link begin with http (although this could be fixed easily).

This is the kind of information I'd need to polish a sed/awk script.  Examples are certainly good, but an enumeration of the possible formats is what's really necessary.  What can be assumed to be always true, what conditions never need to be dealt with, and what varies.

But as you have the real data to test on, hopefully you could polish the scripts yourself (maybe learning a bit of awk along the way).


"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" -  Richard Stallman

Offline

#25 2018-06-26 00:42:17

ss
Member
Registered: 2018-06-04
Posts: 89

Re: How to write bash script which takes data from STDIN & user keypress?

Trilby wrote:

But as you have the real data to test on, hopefully you could polish the scripts yourself (maybe learning a bit of awk along the way).

That's the best way of help. Help the dude to help himself.

Here is my next attempt, based on your input.

BEGIN {
    # list of known hosts
    hosts="(www.host1.com|www.host2.com|www.host3.com)"
    }

$0 ~ hosts {
    # join URLs broken over lines
    if ( $0 ~ /=$/ ) {
        url=$0
        while (getline && $1 ~ /^[^=]/) {
            sub(/=$/, "", $0)
            url=url$0
        }
    # remove trailing =0D,<br> etc.
    gsub(/=(0D.*$)*/, "", url)
    gsub(/(<br>.*$)*/, "", url)

    # only keep the part that is a link
    # to deal with lines like "the file on http://www.example.com will never expire"
    split(url, parts)
    for (i in parts) {
        if ( parts[i] ~ /http/ ) {
            urls[parts[i]] = 1
            }
        }
    }
    }

END {
    for (url in urls) print url
    }

Since possible download hosts are limited and known before hand, I think it would be advisable to match them first, then join broken lines, then remove garbage.

Now I have the following questions:

1. I want to put the regex pattern (list of hosts) in a variable. How can I get a nice indentation?
This doesn't work.

BEGIN {
    # list of known hosts
    hosts="(www.host1.com|\
                  www.host2.com|\
                  www.host3.com)"
    }

2. This part is for joining the broken lines.

while (getline && $1 ~ /^[^=]/) {
            sub(/=$/, "", $0)
            url=url$0
        }

But I don't know how to decipher the regex /^[^=]/


Not yet extensively tested, but from the few mails I have tried so far, it works as expected. I probably need to remove the garbage before http as well. That will be the next step. And needless to say, any advice and critics on any possible improvement are appreciated.

If extract_url works for me, I would definitely write nothing myself and just use the existing working solution. However, now I find out that while extract_url is infinitely more comprehensive and powerful for general usage than this little script, it is also infinitely slower! I will eventually run the whole thing in a low-power headless server, so the difference will be very noticeable. Therefore, I think it's certainly worth the effort, apart from me learning a little bit awk and scripting skills.

Offline

Board footer

Powered by FluxBB