You are not logged in.

#26 2018-06-26 01:00:33

Trilby
Inspector Parrot
Registered: 2011-11-29
Posts: 29,550
Website

Re: How to write bash script which takes data from STDIN & user keypress?

ss wrote:

But I don't know how to decipher the regex /^[^=]/

It's a workaround for my temporary amnesia: I'm better with sed than awk, and it wasn't obvious to me how to do a 'not' match.  Basically, I wanted to do this:

while (getline && ! $1 ~ /^=/)

which would mean getline succeeds and the line does not start with a '=' as the '^' here is the start-anchor.  But this did not work - persumably because the syntax for the '!' (invert or not) was incorrect.  So I just took the easy way out and inverted the pattern to match anything other than an '=' in the first position:

/^[^=]/
   enclose the pattern
/^[^=]/
   match against the start of a line
/^[^=]/
   define a character class: (normally) would match any character listed within the brackets
/^[^=]/
   '^' as the first item of a character class inverts it: so this one will match any single character *other* than '='

So all togther, the loop will continue as long as there is a single character at the start of a line that is not an '=', which has the same effect as breaking the loop on the first line that starts with an '='.

Last edited by Trilby (2018-06-26 01:02:10)


"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" -  Richard Stallman

Offline

#27 2018-06-26 06:32:15

seth
Member
Registered: 2012-09-03
Posts: 51,608

Re: How to write bash script which takes data from STDIN & user keypress?

I think it would be advisable to match them first, then join broken lines, then remove garbage

I don't. Consider

http://foo=\n
.org http://=\n
bar.org http://=\n
foo.org

for filtering out bar.org…

Offline

#28 2018-06-26 09:43:58

ss
Member
Registered: 2018-06-04
Posts: 89

Re: How to write bash script which takes data from STDIN & user keypress?

Trilby wrote:

So all togther, the loop will continue as long as there is a single character at the start of a line that is not an '=', which has the same effect as breaking the loop on the first line that starts with an '='.

OK, now I see. The bigger problem was that I couldn't follow your thought. Sorry that my html sample might have misguided you. It seems that you assume the links will be always followed by a newline of `=0D`? It is not true. It can just be anything. I think the better approach would be joining all the lines ending with `=` and the next one line without.

random noise
known noisehttp://part1=\n
part2 random noise https://partA=\n
partBknown noise
random noise   

The goal is to get http://part1part2 as well as https://partApartB
where `known noise` can be <br>,=0D etc. and `random noise` can be just anything: blank line, plain text, random html labels and so on.

From all the mails I have seen so far, we can assume that the links will always have a gap to random noise. The leading and trailing noise follows a defined pattern.

Last edited by ss (2018-06-26 10:00:11)

Offline

#29 2018-06-26 09:45:58

ss
Member
Registered: 2018-06-04
Posts: 89

Re: How to write bash script which takes data from STDIN & user keypress?

seth wrote:

I think it would be advisable to match them first, then join broken lines, then remove garbage

I don't. Consider

http://foo=\n
.org http://=\n
bar.org http://=\n
foo.org

for filtering out bar.org…

Yes, you are right.
And the bigger problem would be

http://foo=\n
.org http://b=\n
ar.org http://=\n
foo.org

and we want to *keep* bar.org

Offline

#30 2018-06-26 09:53:01

ss
Member
Registered: 2018-06-04
Posts: 89

Re: How to write bash script which takes data from STDIN & user keypress?

I modified the script according to the recent comments.
Three major changes:
1. logic of joining broken lines
2. first join lines, then match host
3. move garbage cleaning to the very last part so that they are executed only when necessary.
EDIT:
4. regex pattern of host with https*://

BEGIN {
    # list of known hosts
    hosts="https*://(host1.com|host2.com|host3.com)"
    }

{
# join URLs broken over lines if needed
url=$0
while ( $0 ~ /=$/ ) {
    sub(/=$/, "", url)
    getline
    url=url$0
    }

split(url, parts)
for (i in parts) {
    part = parts[i]
    # only keep the part with known hosts and http
    # then dedup
    if ( part ~ hosts ) {

        # remove trailing =0D,<br> etc.
        gsub(/=(0D.*$)*/, "", part)
        gsub(/(<br>.*$)*/, "", part)

        urls[part] = 1
        }
    }
}

END {
    for (url in urls) print url
    }

Last edited by ss (2018-06-26 10:17:04)

Offline

Board footer

Powered by FluxBB