You are not logged in.
But I don't know how to decipher the regex /^[^=]/
It's a workaround for my temporary amnesia: I'm better with sed than awk, and it wasn't obvious to me how to do a 'not' match. Basically, I wanted to do this:
while (getline && ! $1 ~ /^=/)
which would mean getline succeeds and the line does not start with a '=' as the '^' here is the start-anchor. But this did not work - persumably because the syntax for the '!' (invert or not) was incorrect. So I just took the easy way out and inverted the pattern to match anything other than an '=' in the first position:
/^[^=]/
enclose the pattern
/^[^=]/
match against the start of a line
/^[^=]/
define a character class: (normally) would match any character listed within the brackets
/^[^=]/
'^' as the first item of a character class inverts it: so this one will match any single character *other* than '='
So all togther, the loop will continue as long as there is a single character at the start of a line that is not an '=', which has the same effect as breaking the loop on the first line that starts with an '='.
Last edited by Trilby (2018-06-26 01:02:10)
"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" - Richard Stallman
Offline
I think it would be advisable to match them first, then join broken lines, then remove garbage
I don't. Consider
http://foo=\n
.org http://=\n
bar.org http://=\n
foo.org
for filtering out bar.org…
Offline
So all togther, the loop will continue as long as there is a single character at the start of a line that is not an '=', which has the same effect as breaking the loop on the first line that starts with an '='.
OK, now I see. The bigger problem was that I couldn't follow your thought. Sorry that my html sample might have misguided you. It seems that you assume the links will be always followed by a newline of `=0D`? It is not true. It can just be anything. I think the better approach would be joining all the lines ending with `=` and the next one line without.
random noise
known noisehttp://part1=\n
part2 random noise https://partA=\n
partBknown noise
random noise
The goal is to get http://part1part2 as well as https://partApartB
where `known noise` can be <br>,=0D etc. and `random noise` can be just anything: blank line, plain text, random html labels and so on.
From all the mails I have seen so far, we can assume that the links will always have a gap to random noise. The leading and trailing noise follows a defined pattern.
Last edited by ss (2018-06-26 10:00:11)
Offline
I think it would be advisable to match them first, then join broken lines, then remove garbage
I don't. Consider
http://foo=\n .org http://=\n bar.org http://=\n foo.org
for filtering out bar.org…
Yes, you are right.
And the bigger problem would be
http://foo=\n
.org http://b=\n
ar.org http://=\n
foo.org
and we want to *keep* bar.org
Offline
I modified the script according to the recent comments.
Three major changes:
1. logic of joining broken lines
2. first join lines, then match host
3. move garbage cleaning to the very last part so that they are executed only when necessary.
EDIT:
4. regex pattern of host with https*://
BEGIN {
# list of known hosts
hosts="https*://(host1.com|host2.com|host3.com)"
}
{
# join URLs broken over lines if needed
url=$0
while ( $0 ~ /=$/ ) {
sub(/=$/, "", url)
getline
url=url$0
}
split(url, parts)
for (i in parts) {
part = parts[i]
# only keep the part with known hosts and http
# then dedup
if ( part ~ hosts ) {
# remove trailing =0D,<br> etc.
gsub(/=(0D.*$)*/, "", part)
gsub(/(<br>.*$)*/, "", part)
urls[part] = 1
}
}
}
END {
for (url in urls) print url
}
Last edited by ss (2018-06-26 10:17:04)
Offline